International-Soil-Radiocarbon-Database / ISRaD

Repository for the development and release of ISRaD data and tools
https://international-soil-radiocarbon-database.github.io/ISRaD/
24 stars 15 forks source link

How to decide/cap spatial data included in ISRaD_extra #246

Closed ShaneStoner closed 3 years ago

ShaneStoner commented 3 years ago

Hi everyone. ISRaD extra currently serves Worldclim v1.4 variables (19 in total). I think there is good agreement that these are broadly useful variables, and widely applicable. During recent discussions, there has been talk of updating these to v2.1, which is fairly trivial to do and includes the same variables. Another dataset was discussed: the TerraClim data product is built with Worldclim and has wider time coverage (1958-2019), and includes such important variables as soil moisture but does not have the full suite of Bioclim variables.

The question then arises again, whether we should be ultra-inclusive in ISRaD extra or to cap our value-added product with broadly-applicable variables (i.e. Bioclim and SoilGrids). We could also choose individual variables to include (soil moisture) from various sites.

A final question: should we keep the Bioclim v1.4 data if we include the v2.1 data?

Thanks to @jb388 for the input thus far, but I would like to ping @coreylawrence and @aahoyt (and of course whoever else has thoughts here)

jb388 commented 3 years ago

Just one more note: with the update to Bioclim v2.1 we have the option to increase the resolution from 2.5' to 30' of arc. Should we go for it?

coreylawrence commented 3 years ago

My opinion is that we should definitely update to Bioclim v2.1 and increase the resolution.

When we make the update, we should time it with the release of a new version - thus archiving a version of ISRaD with v1.4 and then not including the v1.4 data after that. Or perhaps it would be better to archive an intermediate version that has values from both datasets, then transition to only the most up-to-date data. My main point is that I don't think we should continue to include values from older datasets in to perpetuity. Too many variables and potential for confusion about what's what.

As for inclusion of other datasets, I am a bit torn on this.

To the extent that a geospatial dataset can be downloaded, archived, and point values easily extracted, I don't think it is too much effort to add that to the ISRaD build function. However, if points extraction requires accessing remote datasets or significant post-processing (e.g. averaging of multiple annual layers to calculate a long-term mean value), then I think we start to get into the territory where maintaining that functionality may require more time/effort than we can reasonably give it.

For example, I have built code to extract point values of MODIS NPP, PET, and ET but the process requires extracting points from multiple years of record and averaging. Even with the R-based API, only 1000 points can be extracted at one time. Thus, the process involves multiple submissions, downloads, reassembly, and processing. I think that is too much and too prone to problems to do every time we build the database. So perhaps more appropriate as a vignette.

TerraClim is probably somewhere in between Bioclim and MODIS, as far as amount of effort goes. I suspect that full dataset is too big to download and archive, but remote point extraction may be easier to manage and require less post-processing. So maybe that functionality is something that we could add as a supplemental function in the R-package but since it would still require remote point extraction, not as one that is automatically built into ISRaD_Extra.

aahoyt commented 3 years ago

Upgrading to 2.1 sounds good to me. As long as all the variable names stay the same, and it's documented, I don't think we need a transition version with both. It looked like the variable names were similar, but let's confirm. I like Corey's idea of archiving the previous version/making it available somewhere on our website as well.

Higher resolution sounds great if the file sizes are manageable for frequent use.

I'm also in favor of adding more data products to ISRaD_extra if they don't further complicate the build process much (agree with Corey). Soil moisture would be great. We would have to compress their monthly resolution (maybe do min, mean, max like WorldClim bio variables?)

ShaneStoner commented 3 years ago

Thanks everyone for your input. I generally agree with everything said here, and would advocate for updating to WorldClim 2.1 as soon as possible. Do we have an idea for criteria for a "major release", like @coreylawrence mentioned? Then I like the idea of having a hard cutoff between v1.4 and v2.1 that corresponds to a release. For instance, if we update the CRAN package, we could coincide with this switch.

I was able to get the soil moisture data (1981-2010) from TerraClim without too much trouble, and the WorldClim v2.1 files are large (10 GB) but manageable compared to the rest. Thus, I think that we could hold off on updating everything for a little while (month or two?) while we prepare for a release. It's easy to download and extract the data if others would like to use it in the meantime.

@jb388 What is the current status of an update to the CRAN package? Is there a realistic way we could release a major update soon (package, big data push, new geospatial data) in the near future? Or did it happen and I missed it?

jb388 commented 3 years ago

The most recent update of ISRaD on CRAN is ISRaD v1.5.6, which I submitted in September of this year.

I think it would be reasonable to submit an update in early January. Of course the actual data isn't in the R package, but we would be able to describe the new variables in the master template/info files that are built into the package. I have also fixed a few bugs and written some new functions since then, so this sounds like a good excuse to push an update in the near future.

The soil moisture and ET/PET variables would be a nice addition to ISRaD_extra, in my opinion.

Two more points/questions: 1) @ShaneStoner are you going to compute the annual averages for the 1981-2010 period from the set of monthly data? It also seems like they provide annually aggregated data for the whole 1958-2019 period, but maybe I am misunderstanding that. Seems like a bit of work to do the aggregation yourself with that much data, but at least you'd only need to do it once... 2) Returning to the resolution question, all the TerraClim data is at 2.5' and actually uses the 2.5' resolution WorldClim2 data (not the higher resolution 30" WorldClim2 data). Is there any reason to keep all the spatial products at the same resolution? I would lean towards using TerraClim at the 2.5' resolution for all of our climatic data: temp, moisture, soil moisture, PET, ET, etc., rather than a patchwork of different sources and resolutions, but I don't feel super strongly about it.

jb388 commented 3 years ago

The new version of ISRaD, v1.7.8 (which is also on CRAN as of 5 January, 2021), now serves Bioclim v2.1 climate data (variables BIO1-BIO19 in the profile table). The old v1 Bioclim variables are no longer served. In the interest of preventing confusion, the new variables have different names than the old ones.

TerraClim soil moisture data has also been added.

See the ISRaD_Extra_Info.xlsx file for more information on the variables including temporal coverage and spatial resolution. This file has also been updated to provide a bit more information about some variables that we had been serving but were not described at all.