PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
200 stars 231 forks source link

Adding new data sources/datasets with PEcAn #2550

Open ayushprd opened 4 years ago

ayushprd commented 4 years ago

Hi everyone, I see many New Dataset labels with various data sources that can be integrated with PEcAn. I was planning to work on a few of them as a part of my GEE-PEcAn GSoC proposal, are there any datasets which you all would find it helpful if added to PEcAN?

mdietze commented 4 years ago

If there are datasets that we've already tagged as of interest that are also already on GEE, then definitely add them! If you post a list of the overlap between New Dataset and GEE, we'd be happy to help prioritize.

That said, many of the dataset of interest don't live on GEE and are better handled by either different/additional automated workflows (for high-volume, standardized data) or the data ingest app (for pulling in 'long tail' data via DOI, drag-and-drop, or the APIs for generalized data repositories [e.g. DataOne]).

We're definitely interested in advancing all these tools too.

ashiklom commented 4 years ago

Yeah, my inclination would be to use GEE as a fallback if reasonably easy-to-use APIs for the original data aren't available elsewhere. As I recall, GEE does a lot of reprojecting/resampling under the hood to make all the remote sensing imagery line up that ultimately blurs the line between what's real data and what's just resampled or interpolated. That works really well for many of their end-users, but the kind of model-data fusion work that we do with PEcAn may demand a higher level of care. That's not to say we shouldn't use GEE, just that it's usually worth spending a bit of time looking for alternative places to get any given dataset. In some cases, it will be easier and better to retrieve the data from GEE, in which case that's what we should do. But in other cases, it may be easier and better to get the data from a different source.

In particular, in addition to Mike's suggestions above, we should also keep an eye on the capabilities of the DAACs, which not only store the data but are also actively developing tools to make the data easier to retrieve and work with. For example:

ayushprd commented 4 years ago

Thanks, I'll try to find out the overlapping datasets. I understand in some cases it's better to directly use sources like DAACs instead of the GEE.

chilampoon commented 4 years ago

I also wrote down that I want to include more data sources in my data ingest app proposal. I found an R package nasapower can download NASA POWER data in R, and an R interface nasadata to access some of NASA APIs. I am not sure which way is better to go, maybe need more exploration.

mdietze commented 4 years ago

@chilampoon I'm a bit wary of those two suggestions. Both contain a lot of dead links, which isn't a good sign that either is being maintained. NASA POWER appears to be a derived product focused on energy resources, not something that's high on our priority list for ingest. Taking a quick look at the nasadata package's vignette, it really reads like something written by someone who doesn't understand remote sensing (e.g. refers to Landsat 8 as 'low quality imagery'). It also appears that the access that it does provide to Landsat 8 is actually via Google Earth Engine.

This may not be a universal consensus on the PEcAn team, but I think that if you need the raw data from any specific satellite, you're now talking about a sufficiently high-volume data that it makes sense to write code specific to that API (which is what I think @ashiklom was suggesting earlier). But there, the list of what data you want is pretty important! If, on the other hand, you know you want to do a lot of preprocessing steps on the cloud to reduce the data volume of your download (and avoid doing that processing on your own machines) and are OK with any sort of reprojection/interpolation that occurs on GEE, then that service makes sense. GEE also does have the advantage of providing a single interface that is pretty darn fast. Finally, while NASA is awesome, it's not the only space agency out there producing remote sensing data that we need (indeed, I think @istfer wanted the GEE interface to be able to pull Sentinel data)

p.s. In addition to @istfer need for multispectral data (Sentinel, LANDSAT, etc), my personal wishlist for remote sensing data I'd love to see getting into PEcAn is: SMAP soil moisture and vegetation optical depth, GEDI lidar data (which has a new R package: https://github.com/carlos-alberto-silva/rGEDI), SIF from OCO2, OCO3, etc., and ECOSTRESS thermal data. Would also love for someone to tackle improving our pipeline for ingesting NOAA GOES https://doi.org/10.3390/rs11212507

ayushprd commented 4 years ago

Thanks for sharing @mdietze I've already prepared to integrate most of these sources. Will try to find out a way for GOES as well.

chilampoon commented 4 years ago

@mdietze Thanks for the reminder! I wonder if you'd like to import NOAA GOES data only or both GOES and the estimated NDVI? oops is the model proposed in that paper already added into PEcAn?

mdietze commented 4 years ago

The GOES diurnal model in the Wheeler paper currently lives in its own repo outside of PEcAn. Getting that full pipeline implemented and optimized, on top of all the other remote sensing listed, is beyond the scope of GSOC, but getting the initial download more automated would definitely be a helpful/important first step

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 365 days with no activity.