NIEHS / amadeus

https://niehs.github.io/amadeus/
Other
6 stars 2 forks source link

Review `process_*` functions for time data and layer name #68

Closed mitchellmanware closed 3 months ago

mitchellmanware commented 5 months ago

Review "static" data set functions (ie GMTED, population, groads, NLCD) to ensure time information is returned. All objects should have time orientation, even if not frequently updated. May require hard-coding of year.

sigmafelix commented 5 months ago

@mitchellmanware In process_nlcd, I use terra::metags to record year. Since GMTED and SEDAC data were produced in a single year, year could be added in the similar way with year value hard-coded. If we expect future updates in these datasets, it would be good to add year argument in process_* functions.

https://github.com/NIEHS/amadeus/blob/037412812aa82de7443b5dde7bd3600308310807/R/process.R#L680

mitchellmanware commented 5 months ago
mitchellmanware commented 5 months ago

@sigmafelix When reviewing the process_ and calc_ functions, I have noticed that some of the calc_ functions you created accept only SpatVector or sf objects as locations.

Example is from calc_ecoregions

calc_ecoregion <-
  function(
    from = NULL,
    locs,
    locs_id = "site_id",
    ...
  ) {

    if (!methods::is(locs, "SpatVector")) {
      locs <- terra::vect(locs)
    }

Is there a reason you do not use the process_conformity function to accept SpatVector, sf, and data.frame alike?

Could be:

calc_ecoregion <-
  function(
    from = NULL,
    locs,
    locs_id = "site_id",
    ...
  ) {

    if (!methods::is(locs, "SpatVector")) {
      locs <- process_conformity(locs = locs)
    }

to accept all three classes.

mitchellmanware commented 5 months ago

See commit 062f448623296627bf3c6b0b4b96f015ac83c8ea.

Year/range metadata tag has been added for GMTED, groads, population, and Koppen Geiger process_* functions and a $time column for their calc_ functinos. For GMTED and SEDAC population, single year is returned (always 2010 for GMTED and variable for population depending on user-selected year).

For SEDAC groads, Koppen Geiger, and ecoregions functions, I have added the year range coverage as indicated by the datasets' descriptions. For example, SEDAC groads data was collected covering the period of 1980 to 2010, and is therefore added as a metadata tag and covariate column.

> ### sedac groads
> g <- process_sedac_groads(
+   path = "tests/testdata/groads_test.shp"
+ )
> calc_sedac_groads(
+   g,
+   l,
+   "id"
+ )
                id        time GRD_TOTAL_0_01000 GRD_DENKM_0_01000
1 3799900018810101 1980 - 2010          1.762476         0.5633273

> ### koppen geiger
> k <- process_koppen_geiger(
+   path = "tests/testdata/koppen_subset.tif"
+ )
> terra::metags(k)
         year 
"1980 - 2016" 
> calc_koppen_geiger(
+   k,
+   l,
+   "id"
+ )
                id        time DUM_CLRGA_0_00000 DUM_CLRGB_0_00000 DUM_CLRGC_0_00000 DUM_CLRGD_0_00000 DUM_CLRGE_0_00000
1 3799900018810101 1980 - 2016                 0                 0                 1                 0                 0

> ### ecoregions
> e <- process_ecoregion(
+   path = "tests/testdata/eco_l3_clip.gpkg"
+ )
> site_faux <-
+   data.frame(
+     site_id = "37999109988101",
+     lon = -77.576,
+     lat = 39.40,
+     date = as.Date("2022-01-01")
+   )
> site_faux <-
+   terra::vect(
+     site_faux,
+     geom = c("lon", "lat"),
+     keepgeom = TRUE,
+     crs = "EPSG:4326")
> site_faux <- terra::project(site_faux, "EPSG:5070")
> calc_ecoregion(
+   e,
+   site_faux,
+   "site_id"
+ )
         site_id        time DUM_E2083_0_00000 DUM_E3064_0_00000
1 37999109988101 1997 - 2024                 1                 1
> 

Although this does not conform to the normal values in the $time column, at least it is consistent with the original dataset.

HUC and OpenLandMap are the only datasets that do not include some sort of time information.

sigmafelix commented 5 months ago

@mitchellmanware I think time field is supposed to be working as one of keys. In the demonstration above, the time field looks like a field with description on the time of representation in the source dataset. An advantage of using time field as a key is that users will be able to join multiple calc_* results with common keys. Could we move the source data description into a separate field with a name, for example, description?

mitchellmanware commented 5 months ago

@sigmafelix Yes, that makes sense. I will update.

mitchellmanware commented 5 months ago

Update

> e <- process_ecoregion(
+   path = "tests/testdata/eco_l3_clip.gpkg"
+ )
> site_faux <-
+    data.frame(
+      id = "1",
+      lon = -77.576,
+      lat = 39.40,
+      date = as.Date("2022-01-01")
+ )
> site_faux <- terra::vect(site_faux, crs = "EPSG:4326")
> site_proj <- terra::project(site_faux, terra::crs(e))
> calc_ecoregion(
+   e,
+   site_proj,
+   "id"
+ )
  id description DUM_E2083_0_00000 DUM_E3064_0_00000
1  1 1997 - 2024                 1                 1