Add water variables - Githubissues

sigmafelix commented 6 months ago

Air quality/hydrography data support in download/process/calculate suites. As download functions are already in USGS's nhdplusTools and dataRetrieval packages, I think priority is given to the other two. From PrestoGP covariate list, water variables are:

[ ] NAWQA Pesticide (county level) estimates

@kyle-messier One question on the direction is if we want to calculate station data by nearest neighbor spatial join or SEDC, or summarize station data at HUCs then do spatial join afterwards.

Other variables that are not water-related are put in low priority:

[ ] Soil chemistry (point-based), USGS
[x] OpenLandMap
[ ] Geology
[x] TerraClimate
[x] Cropscape (aka CDL)
[x] PRISM

kyle-messier commented 6 months ago

@sigmafelix @mitchellmanware

I think NAWQA Pesticide is too niche and should not include it
The soil chemistry is too irregular for each parameter so probably exclude too

These are raster based, so we should consider these and should be straight forward extension based on on all of your raster downloading, processing functions

[x] OLM has the landmap package, but is not maintained. It calls an API, which may work, but your approaches could be a nice addition.
[x] TerraClimate provides an R-script for accessing THREDDS. I don't think y'all have used this approach yet. Let's chat whether that would make sense or to keep with data download of yearly files, etc. I think taking the code they suggested for accessing the NetCDF via THREDDS could be useful, plus y'all have developed built in checks to deal with server side issues that the average user doesn't know how to deal with.
[x] CropScape also has the CropScapeR package for downloading that. I'm not entirely of its utility.
[x] PRISM appears to have its own, nice and maintained package for data access.

So these rasters clearly have data download options with varying utility. We can discuss utility vs effort for replicating those efforts and consolidating into 1 package.

To answer your main question, I think the key functionality we should add for water quality is the HUC based buffer. (1) Finding the HUC12/10/8 for a point, (2) Calculate summary statistics of rasters in the HUC. I don't see any SEDC variable that can be calculated.

mitchellmanware commented 6 months ago

There seem to be a couple of R packages that have TerraClimate functions (QBMS, datazoom.amazonia, climateR), but QBMS::get_terraclimate returns values extracted at points, datazoom.amazonia::load_climate returns values as a tibble, and climateR::getTerraClim requires cropping/extraction of the raw data.

None of the packages offer URL to machine download functions, which would be a useful contribution. TerraClimate is from the same group who produce GridMet, so the common catalog makes for easy addition of these sources to amadeus.

sigmafelix commented 6 months ago

As our list of supported datasets/covariates expands, I do think that we should modularize the calculation functions like what we did for download and test functions. We could classify several common patterns found in the calculation functions then separate these parts into independent functions. It will make maintenance significantly easier. Of course, in some datasets we will not be able to attain the same level of abstraction. Some chopin functions could be adopted or referred into amadeus functions like calc_sedc case.

mitchellmanware commented 6 months ago

Agreed. Many of the dataset specific nuances are cleaned up in the processing stage, which makes the covariate functions much more reproducible.

How do you want to divide workload between new functions and modularizing the existing calc_ functions? I have no preference for either @sigmafelix

sigmafelix commented 6 months ago

@mitchellmanware I will draft CropScape and PRISM process/calculate functions. HUC 8 and 12 are available via nhdplusTools. Perhaps a HUC lookup function is necessary for users who want to download a small subset of HUCs. I think most users will want to use a subset rather than the entire national dataset. FYI, national HUC boundary vectors are pretty large (WBD: 2.5GB compressed; NHDPlus: 28.1GB compressed).

mitchellmanware commented 6 months ago

I will work on modularizing what we have in calc_ and creating the suite of functions for TerraClimate and GridMet data sources first, and then move onto the HUC tools as I need to read more about what is available.

mitchellmanware commented 5 months ago

gridMET and TerraClimate functions added - moving onto modularization

mitchellmanware commented 5 months ago

New gridMET and TerraClimate functions, as well as calc_* modularization, have been implemented on branch mm-terraclimate-0325. Functionality for OpenLandMap data will take longer, so I will start on new branch.

@sigmafelix @kyle-messier Based on beethoven's new README.md and the Teams post about "feature engineering", should we consider updating function names from calc_*() to features_*() and calc_covariates() to create_features() or something similar. I foresee us using "features" and "feature engineering" more than "covariate calculation/creation" in the manuscript, so consistency across both would be good.

NIEHS / amadeus

Add water variables #45