NIEHS / amadeus

https://niehs.github.io/amadeus/
MIT License
2 stars 2 forks source link

Download, process, and covariate calculation package- next steps #5

Closed mitchellmanware closed 3 weeks ago

mitchellmanware commented 5 months ago

Updates, additions, and next steps for suite of functions designed to download, process, and calculation covariates from geospatial environmental data sources. @sigmafelix @Spatiotemporal-Exposures-and-Toxicology

mitchellmanware commented 5 months ago

Download Functions

Processing/Covariate Calculation Functions

Discussion

  1. Should new data sources be included in first version of package, or should we focus on developing a full suite of processing/calculation functions around the data download functions we already have?
  2. Processing and calculate covariates function structure and names.
kyle-messier commented 5 months ago

Thanks Mitchell

  1. Yes to code naming consistency
  2. It looks like you are considering adding other covariates - that is a good list. If it is straight forward to do so, then go ahead. But that can be a future release too
  3. We still need package names!
mitchellmanware commented 5 months ago

Whether the additional data sources are included in the first or a later version, I think our immediate priority is cleaning up what we have. It will be easier to add additional download, process, and calculation functions once we have standard function names and more robust unit/integration tests.

mitchellmanware commented 5 months ago

The code naming consistency can be based on the output of each function.

The download... functions return raw, unprocessed downloaded data from various source.

The process... functions clean these raw data (ie. variable selection from netCDF, spatial subset) and return a single SpatRaster/SpatVector object. The user would have to utilize only two functions to have cleaned and interpretable geospatial data in the R environment.

The calc... functions can be renamed to covariate... because the extraction of data at points or polygons is limited scope for a title of "calc". calc.. functions can be created to calculate spatio-temporal summary statistics (ie. average temperature within counties/census tracts or user defined set of polygons)

mitchellmanware commented 5 months ago

Mitchell To Do (January 17, 2023) -

mitchellmanware commented 5 months ago

process...

Parameters

Unit Tests

sigmafelix commented 5 months ago

calc_, process_ or covariate_ functions may need to include extent/tile range for flexibility. The functions in this package, at least in calc_modis function and its dependencies, are very generic to accept file paths and select one-day file paths to process.

We mainly use circular buffer polygons, which are generated inside each calc_ function, to extract values from raster datasets. Generic polygon inputs need to be supported for general cases of summary (jurisdictions, isochrones, etc.). One option is to consider circular buffers as default.

mitchellmanware commented 5 months ago

Discussion 2024/01/2018 @sigmafelix @Spatiotemporal-Exposures-and-Toxicology

sigmafelix commented 5 months ago

For Point 1, MODIS-related auxiliary functions are separated from the main functions then placed in R/calculate_covariatessupport.R. The file and function names will be changed in the next PR. I guess `covariateis supposed to becalculate_`?

For Point 4 (helper function), I think a spatial data class conversion function will be necessary. Functions written earlier (mostly by me) usually accept sf or SpatVector in sites argument, whereas the later versions accept stdt.

For Point 5, I think unifying common argument names across process_ function needs to be prioritized. Before then, we need to think about Point 6 and the format of covariate source datasets. What would be a proper form, a sf/stars/Spat* object or file path(s)? The intermediate product from process_ could be vector or raster, so we might classify processing functions in two groups and design uniform arguments in calculate_ functions accordingly.

mitchellmanware commented 5 months ago

@sigmafelix i agree with all your points. For point 5/6, my thought was that the process_ functions would return a cleaned Spat* object into the user's R environment. By "cleaned" I mean

The "cleaning" does not have to include manipulation/calculation/alteration of the data values in any way (ignore contradiction with daily = in process_geos 😅), which would make the functions 1. simpler and 2. rely on fewer arguments. The essential arguments that come to mind are temporal range (start and end), variable, and directory with data. Other source specific items (ie. collection) can be verified by checking for the presence of the variable within the data's varnames rather than being a separate argument.

I think generating cleaned data in a two step workflow from download_data(...) > process_(...) is useful enough to the users where it saves lots of time, but also general enough so they/we can run our own calc_/covar_ functions or proceed with other analyses in R.

Lots of rambling thoughts so I will refine and post again with an example.

mitchellmanware commented 5 months ago

import_geos and import_narr have been simplified to utilize only date_start, date_end, variable, and directory_with_data parameters. The functions import and clean data, providing the user with a single SpatRaster object for their desired temporal range and variable. The data is not changed in any way, only the CRS, time, and variable names are cleaned for information retention (ie pressure levels in variable names).

https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/mm-process-functions/R/import.R

> d <- "2018-01-01"
> dwd <- "../testdata/"
> g <- import_geos(
+   date_start = d,
+   date_end = d,
+   variable = "O3",
+   paste0(dwd, "geos/aqc_tavg_1hr_g1440x721_v1"))
Cleaning O3 data for 2018-01-01 00:30:00...
Cleaning O3 data for 2018-01-01 01:30:00...
Cleaning O3 data for 2018-01-01 02:30:00...
Cleaning O3 data for 2018-01-01 03:30:00...
Cleaning O3 data for 2018-01-01 04:30:00...
Cleaning O3 data for 2018-01-01 05:30:00...
Cleaning O3 data for 2018-01-01 06:30:00...
Cleaning O3 data for 2018-01-01 07:30:00...
Cleaning O3 data for 2018-01-01 08:30:00...
Cleaning O3 data for 2018-01-01 09:30:00...
Cleaning O3 data for 2018-01-01 10:30:00...
Cleaning O3 data for 2018-01-01 11:30:00...
Cleaning O3 data for 2018-01-01 12:30:00...
Cleaning O3 data for 2018-01-01 13:30:00...
Cleaning O3 data for 2018-01-01 14:30:00...
Cleaning O3 data for 2018-01-01 15:30:00...
Cleaning O3 data for 2018-01-01 16:30:00...
Cleaning O3 data for 2018-01-01 17:30:00...
Cleaning O3 data for 2018-01-01 18:30:00...
Cleaning O3 data for 2018-01-01 19:30:00...
Cleaning O3 data for 2018-01-01 20:30:00...
Cleaning O3 data for 2018-01-01 21:30:00...
Cleaning O3 data for 2018-01-01 22:30:00...
Cleaning O3 data for 2018-01-01 23:30:00...
Returning hourly O3 data from 2018-01-01 to 2018-01-01.
> g
class       : SpatRaster 
dimensions  : 721, 1440, 24  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180.125, 179.875, -90.125, 90.125  (xmin, xmax, ymin, ymax)
coord. ref. :  
sources     : GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:O3  
              GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0130z.nc4:O3  
              GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0230z.nc4:O3  
              ... and 21 more source(s)
varnames    : O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              ...
names       : O3_le~03000, O3_le~13000, O3_le~23000, O3_le~33000, O3_le~43000, O3_le~53000, ... 
unit        :   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1, ... 
time        : 2018-01-01 00:30:00 to 2018-01-01 23:30:00 UTC 
> names(g)[1:10]
 [1] "O3_lev=72_20180101_003000" "O3_lev=72_20180101_013000"
 [3] "O3_lev=72_20180101_023000" "O3_lev=72_20180101_033000"
 [5] "O3_lev=72_20180101_043000" "O3_lev=72_20180101_053000"
 [7] "O3_lev=72_20180101_063000" "O3_lev=72_20180101_073000"
 [9] "O3_lev=72_20180101_083000" "O3_lev=72_20180101_093000"
> time(g)[1:10]
 [1] "2018-01-01 00:30:00 UTC" "2018-01-01 01:30:00 UTC"
 [3] "2018-01-01 02:30:00 UTC" "2018-01-01 03:30:00 UTC"
 [5] "2018-01-01 04:30:00 UTC" "2018-01-01 05:30:00 UTC"
 [7] "2018-01-01 06:30:00 UTC" "2018-01-01 07:30:00 UTC"
 [9] "2018-01-01 08:30:00 UTC" "2018-01-01 09:30:00 UTC"
> n <- import_narr(
+   date_start = d,
+   date_end = d,
+   variable = "omega",
+   paste0(dwd, "narr/omega/"))
Cleaning data for year 2018...
Returning daily omega data from 20180101 to 20180101.
> n
class       : SpatRaster 
dimensions  : 277, 349, 29  (nrow, ncol, nlyr)
resolution  : 32462.99, 32463  (x, y)
extent      : -16231.49, 11313351, -16231.5, 8976020  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=lcc +lat_0=50 +lon_0=-107 +lat_1=50 +lat_2=50 +x_0=5632642.22547 +y_0=4612545.65137 +datum=WGS84 +units=m +no_defs 
source      : omega.201801.nc:omega 
varname     : omega (Daily Omega on Pressure Levels) 
names       : omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, ... 
unit        :    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s, ... 
time        : 2018-01-01 UTC 
> names(n)[1:10]
 [1] "omega_level=1000_20180101" "omega_level=975_20180101" 
 [3] "omega_level=950_20180101"  "omega_level=925_20180101" 
 [5] "omega_level=900_20180101"  "omega_level=875_20180101" 
 [7] "omega_level=850_20180101"  "omega_level=825_20180101" 
 [9] "omega_level=800_20180101"  "omega_level=775_20180101" 
> time(n)[1:10]
 [1] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
 [5] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
 [9] "2018-01-01 UTC" "2018-01-01 UTC"
> 

@sigmafelix Is this more along the lines of "common argument names"?

mitchellmanware commented 5 months ago

Performance checks for spatial subsetting with import_* functions

import_geos is function as currently exists on mm-process-functions branch. Spatial subsetting occurs external to the function after data has been cleaned. import_geos_crop_final crops the final single SpatRaster after all data has been cleaned. import_geos_crop_layer crops each layer of the data individually after cleaning but before merging with final SpatRaster.

Test with collection aqc_tavg_1hr_g1440x721_v1 (single level):

> microbenchmark(
  crop_external = import_geos(
    variable = "O3",
    directory_with_data = d
  ) %>>% terra::crop(
    project(
      ct,
      "EPSG:4326"
    )
  ),
  crop_final = import_geos_crop_final(
    variable = "O3",
    directory_with_data = d,
    crop = ct
  ),
  crop_layer = import_geos_crop_layer(
    variable = "O3",
    directory_with_data = d,
    crop = ct
  ),
  times = 5
)
Unit: seconds
          expr      min       lq     mean   median       uq      max neval cld
 crop_external 2.469672 2.484679 2.492802 2.485250 2.497332 2.527078     5   a
    crop_final 2.472164 2.472217 2.488969 2.473530 2.498845 2.528091     5   a
    crop_layer 2.636986 2.655052 2.912806 2.661639 2.692731 3.917623     5   a

Test with collection chm_inst_1hr_g1440x721_p23 (pressure levels):

> microbenchmark(
+   crop_external = import_geos(
+     variable = "O3",
+     directory_with_data = d
+   ) %>>% terra::crop(
+     project(
+       ct,
+       "EPSG:4326"
+     )
+   ),
+   crop_final = import_geos_crop_final(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   crop_layer = import_geos_crop_layer(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   times = 5
+ )
Unit: seconds
          expr       min       lq     mean   median       uq      max neval cld
 crop_external  9.940528 10.03314 10.17560 10.19378 10.19413 10.51641     5   a
    crop_final 10.039182 10.10433 10.82902 10.12391 10.52642 13.35124     5   a
    crop_layer 10.357362 10.50860 10.56581 10.55911 10.62703 10.77697     5   a

At least for GEOS-CF data, performance between these functions was similar, so we can discussion the inclusion of crop = or subset = as a standard parameter in the import_* functions.

Local When run locally, spatial subsetting after layer-specific cleaning is optimal by 10 - 15 seconds.

> microbenchmark(
+   crop_external = import_geos(
+     variable = "O3",
+     directory_with_data = d
+   ) %>>% terra::crop(
+     project(
+       ct,
+       "EPSG:4326"
+     )
+   ),
+   crop_final = import_geos_crop_final(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   crop_layer = import_geos_crop_layer(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   times = 2
+ )
Unit: seconds
          expr      min       lq     mean   median       uq      max neval
 crop_external 464.0877 464.0877 464.4559 464.4559 464.8242 464.8242     2
    crop_final 458.4853 458.4853 460.2615 460.2615 462.0377 462.0377     2
    crop_layer 446.7328 446.7328 449.1123 449.1123 451.4919 451.4919     2
>
sigmafelix commented 4 months ago

process_* functions will do pretty similar functions as modis_*. I will work on separate processing parts from calc_* first then unifying and minimizing arguments of process_* functions.

TODO

cf. Argument lists by functions in R/calculate_covariates.R and R/calculate_covariates_support.R

# calc_* function arguments 
                    fun path sites id_col product name_covariates radius subdataset fun_summary nthreads package_list_add
1        calc_ecoregion TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
2    calc_koppen_geiger TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
3            calc_modis TRUE  TRUE   TRUE    TRUE            TRUE   TRUE       TRUE        TRUE     TRUE             TRUE
4              calc_nei TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
5       calc_nlcd_ratio TRUE  TRUE     NA      NA              NA   TRUE         NA          NA       NA               NA
6 calc_temporal_dummies   NA  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
7              calc_tri TRUE  TRUE   TRUE      NA              NA   TRUE         NA          NA       NA               NA
  export_list_add year county_shp sites_epsg domain_year
1              NA   NA         NA         NA          NA
2              NA   NA         NA         NA          NA
3            TRUE   NA         NA         NA          NA
4              NA TRUE       TRUE       TRUE          NA
5              NA TRUE         NA         NA          NA
6              NA   NA         NA         NA        TRUE
7              NA   NA         NA       TRUE        TRUE

# modis_* function arguments 
                     fun path product nsds fun_agg paths date_in regex_sds  foo get_var resolution custom_sel subdataset crs_ref
1    modis_aggregate_sds TRUE    TRUE TRUE    TRUE    NA      NA        NA   NA      NA         NA         NA         NA      NA
2          modis_get_vrt   NA    TRUE   NA      NA  TRUE    TRUE      TRUE TRUE      NA         NA         NA         NA      NA
3     modis_mosaic_mod06   NA      NA   NA      NA  TRUE    TRUE        NA   NA    TRUE       TRUE         NA         NA      NA
4    modis_prefilter_sds   NA    TRUE   NA      NA    NA      NA        NA   NA      NA         NA       TRUE         NA      NA
5 modis_preprocess_vnp46   NA      NA   NA      NA  TRUE    TRUE        NA   NA      NA         NA         NA       TRUE    TRUE
6       modis_warp_stars TRUE      NA   NA      NA    NA      NA        NA   NA      NA         NA         NA         NA      NA
7           modis_worker   NA    TRUE   NA      NA    NA      NA        NA   NA      NA         NA         NA         NA      NA
  cellsize threshold crs_out raster date sites_in name_extracted fun_summary_raster id_col radius
1       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
2       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
3       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
4       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
5       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
6     TRUE      TRUE    TRUE     NA   NA       NA             NA                 NA     NA     NA
7       NA        NA      NA   TRUE TRUE     TRUE           TRUE               TRUE   TRUE   TRUE
sigmafelix commented 4 months ago

@mitchellmanware As per our plan to split calc_* into process_* and calculate_* and to generalize functions, we may need to rename common arguments in these functions. At least in functions I've been working on, path and sites are usually used. I think generic names such as x and y or loc/locs and from, which nuance "calculate covariates at x/loc/locs from from. sites sounds a little specific to AQS cases. I'd be happy to discuss more generic argument names.

kyle-messier commented 4 months ago

@sigmafelix It's always nice to have as much internal consistency - could potentially even be made into a special use-case advanced linter (e.g. arugment_name_linter)

mitchellmanware commented 4 months ago

I agree these general names are more broadly applicable. I think there could be more variability in the process_ function arguments as the raw data can take on many forms, but we can implement vectorized arguments for collection/variable or statistic/resolution to retain consistency.

ie for the GMTED processing function:

gmted <-
        import_gmted(
          variable = c("Breakline Emphasis", "7.5 arc seconds),
          directory_with_data =
          "../testdata/gmted/gmted"
        )

The variable argument takes both statistic and resolution but conforms to the simplified argument names.

mitchellmanware commented 4 months ago

We can also apply the path argument to the download functions. I do not recall why I created separate directory_to_download and directory_to_save folders.

sigmafelix commented 4 months ago

@mitchellmanware The separation was for distinguishing a directory of compressed raw data from a directory of the unzipped data (what we actually use to process and calculate). directory_to_download was for the former, whereas directory_to_save was for the latter. One way to simplify the directory setting could be to force making two directories under directory_to_save, say, zips and raw inside download functions. Exception could be handled with the extension in download links.

@Spatiotemporal-Exposures-and-Toxicology Yes, a custom linter will help a lot! I will work on adding one.

mitchellmanware commented 4 months ago

Right, thanks for the reminder. That would allow for succinct folder arguments across functions which is good, but seems low priority for now. I will revisit next week.

sigmafelix commented 4 months ago

13 is merged into main. I reflected what we've agreed in our discussion. Reflecting my comment in #7, sites is replaced with locs, path remains in process_* functions, and is replaced with from in calc_* functions.

mitchellmanware commented 4 months ago

Great, thank you Insang. I am very close to being done with my function migration and will create pull request once finished.

kyle-messier commented 3 weeks ago

@mitchellmanware closing - reopen if needed. 🤙