Closed sigmafelix closed 3 days ago
@sigmafelix @mitchellmanware
These are raster based, so we should consider these and should be straight forward extension based on on all of your raster downloading, processing functions
So these rasters clearly have data download options with varying utility. We can discuss utility vs effort for replicating those efforts and consolidating into 1 package.
To answer your main question, I think the key functionality we should add for water quality is the HUC based buffer. (1) Finding the HUC12/10/8 for a point, (2) Calculate summary statistics of rasters in the HUC. I don't see any SEDC variable that can be calculated.
There seem to be a couple of R packages that have TerraClimate functions (QBMS, datazoom.amazonia, climateR), but QBMS::get_terraclimate
returns values extracted at points, datazoom.amazonia::load_climate
returns values as a tibble
, and climateR::getTerraClim
requires cropping/extraction of the raw data.
None of the packages offer URL to machine download functions, which would be a useful contribution. TerraClimate is from the same group who produce GridMet, so the common catalog makes for easy addition of these sources to amadeus
.
As our list of supported datasets/covariates expands, I do think that we should modularize the calculation functions like what we did for download and test functions. We could classify several common patterns found in the calculation functions then separate these parts into independent functions. It will make maintenance significantly easier. Of course, in some datasets we will not be able to attain the same level of abstraction. Some chopin
functions could be adopted or referred into amadeus
functions like calc_sedc
case.
Agreed. Many of the dataset specific nuances are cleaned up in the processing stage, which makes the covariate functions much more reproducible.
How do you want to divide workload between new functions and modularizing the existing calc_
functions? I have no preference for either @sigmafelix
@mitchellmanware I will draft CropScape and PRISM process/calculate functions. HUC 8 and 12 are available via nhdplusTools
. Perhaps a HUC lookup function is necessary for users who want to download a small subset of HUCs. I think most users will want to use a subset rather than the entire national dataset. FYI, national HUC boundary vectors are pretty large (WBD: 2.5GB compressed; NHDPlus: 28.1GB compressed).
I will work on modularizing what we have in calc_
and creating the suite of functions for TerraClimate
and GridMet
data sources first, and then move onto the HUC tools as I need to read more about what is available.
gridMET and TerraClimate functions added - moving onto modularization
New gridMET and TerraClimate functions, as well as calc_*
modularization, have been implemented on branch mm-terraclimate-0325
. Functionality for OpenLandMap data will take longer, so I will start on new branch.
@sigmafelix @kyle-messier
Based on beethoven
's new README.md and the Teams post about "feature engineering", should we consider updating function names from calc_*()
to features_*()
and calc_covariates()
to create_features()
or something similar. I foresee us using "features" and "feature engineering" more than "covariate calculation/creation" in the manuscript, so consistency across both would be good.
@sigmafelix @mitchellmanware I think this is stale, so I am closing this - we can make other issues with the proper new dataset
tag if needed later
Air quality/hydrography data support in download/process/calculate suites. As download functions are already in USGS's
nhdplusTools
anddataRetrieval
packages, I think priority is given to the other two. From PrestoGP covariate list, water variables are:@kyle-messier One question on the direction is if we want to calculate station data by nearest neighbor spatial join or SEDC, or summarize station data at HUCs then do spatial join afterwards.
Other variables that are not water-related are put in low priority: