Closed mitchellmanware closed 3 weeks ago
process_hms
> calc_hms
vs calc_geos
). Should we use consistent terms and separate functions based on their position within the progression from "data exists online > data is downloaded to machine > data is subsetted to variables/locations of interest > data imported into R" (example only).calc_data
to covariates_data
for the generation of covariates.
calc_data
can be created for calculating temporal mean/median/range etc, but covariate_data
is more specific for the extraction of data at point locations.Thanks Mitchell
Whether the additional data sources are included in the first or a later version, I think our immediate priority is cleaning up what we have. It will be easier to add additional download, process, and calculation functions once we have standard function names and more robust unit/integration tests.
The code naming consistency can be based on the output of each function.
The download...
functions return raw, unprocessed downloaded data from various source.
The process...
functions clean these raw data (ie. variable selection from netCDF, spatial subset) and return a single SpatRaster/SpatVector object. The user would have to utilize only two functions to have cleaned and interpretable geospatial data in the R environment.
The calc...
functions can be renamed to covariate...
because the extraction of data at points or polygons is limited scope for a title of "calc". calc..
functions can be created to calculate spatio-temporal summary statistics (ie. average temperature within counties/census tracts or user defined set of polygons)
calc_narr
and calc_geos
to have separate process...
and covariate...
functions
process...
functions
crop =
, variables =
, temporal =
(for GEOS-CF as daily data may be more useful than hourly)process...
and covariate...
functionsdownload... > process... > covariates...
pipelinecalc...
)process...
Parameters
Unit Tests
calc_
, process_
or covariate_
functions may need to include extent/tile range for flexibility. The functions in this package, at least in calc_modis
function and its dependencies, are very generic to accept file paths and select one-day file paths to process.
We mainly use circular buffer polygons, which are generated inside each calc_
function, to extract values from raster datasets. Generic polygon inputs need to be supported for general cases of summary (jurisdictions, isochrones, etc.). One option is to consider circular buffers as default.
Discussion 2024/01/2018 @sigmafelix @Spatiotemporal-Exposures-and-Toxicology
calc_*
functions into separate process_
and covariate_
functionsprocess_
function w/o spatial subsetdownload_
> process_
pipelinedownload_
and process_
functionsprocess_
functions so users can do.call
both download_
and process_
function sequentially with same parameters. Also helpful for integration testing.calculate_
function based on uniform output from process_
buffer =
as standard parameter for calculate_
function(s)cat(...)
statements (for users)For Point 1, MODIS-related auxiliary functions are separated from the main functions then placed in R/calculate_covariatessupport.R. The file and function names will be changed in the next PR. I guess `covariateis supposed to be
calculate_`?
For Point 4 (helper function), I think a spatial data class conversion function will be necessary. Functions written earlier (mostly by me) usually accept sf
or SpatVector
in sites
argument, whereas the later versions accept stdt
.
For Point 5, I think unifying common argument names across process_
function needs to be prioritized. Before then, we need to think about Point 6 and the format of covariate source datasets. What would be a proper form, a sf
/stars
/Spat*
object or file path(s)? The intermediate product from process_
could be vector or raster, so we might classify processing functions in two groups and design uniform arguments in calculate_
functions accordingly.
@sigmafelix
i agree with all your points.
For point 5/6, my thought was that the process_
functions would return a cleaned Spat*
object into the user's R environment. By "cleaned" I mean
The "cleaning" does not have to include manipulation/calculation/alteration of the data values in any way (ignore contradiction with daily =
in process_geos
😅), which would make the functions 1. simpler and 2. rely on fewer arguments. The essential arguments that come to mind are temporal range (start and end), variable, and directory with data. Other source specific items (ie. collection) can be verified by checking for the presence of the variable within the data's varnames rather than being a separate argument.
I think generating cleaned data in a two step workflow from download_data(...) > process_(...)
is useful enough to the users where it saves lots of time, but also general enough so they/we can run our own calc_
/covar_
functions or proceed with other analyses in R.
Lots of rambling thoughts so I will refine and post again with an example.
import_geos
and import_narr
have been simplified to utilize only date_start
, date_end
, variable
, and directory_with_data
parameters. The functions import and clean data, providing the user with a single SpatRaster object for their desired temporal range and variable. The data is not changed in any way, only the CRS, time, and variable names are cleaned for information retention (ie pressure levels in variable names).
> d <- "2018-01-01"
> dwd <- "../testdata/"
> g <- import_geos(
+ date_start = d,
+ date_end = d,
+ variable = "O3",
+ paste0(dwd, "geos/aqc_tavg_1hr_g1440x721_v1"))
Cleaning O3 data for 2018-01-01 00:30:00...
Cleaning O3 data for 2018-01-01 01:30:00...
Cleaning O3 data for 2018-01-01 02:30:00...
Cleaning O3 data for 2018-01-01 03:30:00...
Cleaning O3 data for 2018-01-01 04:30:00...
Cleaning O3 data for 2018-01-01 05:30:00...
Cleaning O3 data for 2018-01-01 06:30:00...
Cleaning O3 data for 2018-01-01 07:30:00...
Cleaning O3 data for 2018-01-01 08:30:00...
Cleaning O3 data for 2018-01-01 09:30:00...
Cleaning O3 data for 2018-01-01 10:30:00...
Cleaning O3 data for 2018-01-01 11:30:00...
Cleaning O3 data for 2018-01-01 12:30:00...
Cleaning O3 data for 2018-01-01 13:30:00...
Cleaning O3 data for 2018-01-01 14:30:00...
Cleaning O3 data for 2018-01-01 15:30:00...
Cleaning O3 data for 2018-01-01 16:30:00...
Cleaning O3 data for 2018-01-01 17:30:00...
Cleaning O3 data for 2018-01-01 18:30:00...
Cleaning O3 data for 2018-01-01 19:30:00...
Cleaning O3 data for 2018-01-01 20:30:00...
Cleaning O3 data for 2018-01-01 21:30:00...
Cleaning O3 data for 2018-01-01 22:30:00...
Cleaning O3 data for 2018-01-01 23:30:00...
Returning hourly O3 data from 2018-01-01 to 2018-01-01.
> g
class : SpatRaster
dimensions : 721, 1440, 24 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : -180.125, 179.875, -90.125, 90.125 (xmin, xmax, ymin, ymax)
coord. ref. :
sources : GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:O3
GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0130z.nc4:O3
GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0230z.nc4:O3
... and 21 more source(s)
varnames : O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air)
O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air)
O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air)
...
names : O3_le~03000, O3_le~13000, O3_le~23000, O3_le~33000, O3_le~43000, O3_le~53000, ...
unit : mol mol-1, mol mol-1, mol mol-1, mol mol-1, mol mol-1, mol mol-1, ...
time : 2018-01-01 00:30:00 to 2018-01-01 23:30:00 UTC
> names(g)[1:10]
[1] "O3_lev=72_20180101_003000" "O3_lev=72_20180101_013000"
[3] "O3_lev=72_20180101_023000" "O3_lev=72_20180101_033000"
[5] "O3_lev=72_20180101_043000" "O3_lev=72_20180101_053000"
[7] "O3_lev=72_20180101_063000" "O3_lev=72_20180101_073000"
[9] "O3_lev=72_20180101_083000" "O3_lev=72_20180101_093000"
> time(g)[1:10]
[1] "2018-01-01 00:30:00 UTC" "2018-01-01 01:30:00 UTC"
[3] "2018-01-01 02:30:00 UTC" "2018-01-01 03:30:00 UTC"
[5] "2018-01-01 04:30:00 UTC" "2018-01-01 05:30:00 UTC"
[7] "2018-01-01 06:30:00 UTC" "2018-01-01 07:30:00 UTC"
[9] "2018-01-01 08:30:00 UTC" "2018-01-01 09:30:00 UTC"
> n <- import_narr(
+ date_start = d,
+ date_end = d,
+ variable = "omega",
+ paste0(dwd, "narr/omega/"))
Cleaning data for year 2018...
Returning daily omega data from 20180101 to 20180101.
> n
class : SpatRaster
dimensions : 277, 349, 29 (nrow, ncol, nlyr)
resolution : 32462.99, 32463 (x, y)
extent : -16231.49, 11313351, -16231.5, 8976020 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=lcc +lat_0=50 +lon_0=-107 +lat_1=50 +lat_2=50 +x_0=5632642.22547 +y_0=4612545.65137 +datum=WGS84 +units=m +no_defs
source : omega.201801.nc:omega
varname : omega (Daily Omega on Pressure Levels)
names : omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, ...
unit : Pascal/s, Pascal/s, Pascal/s, Pascal/s, Pascal/s, Pascal/s, ...
time : 2018-01-01 UTC
> names(n)[1:10]
[1] "omega_level=1000_20180101" "omega_level=975_20180101"
[3] "omega_level=950_20180101" "omega_level=925_20180101"
[5] "omega_level=900_20180101" "omega_level=875_20180101"
[7] "omega_level=850_20180101" "omega_level=825_20180101"
[9] "omega_level=800_20180101" "omega_level=775_20180101"
> time(n)[1:10]
[1] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
[5] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
[9] "2018-01-01 UTC" "2018-01-01 UTC"
>
@sigmafelix Is this more along the lines of "common argument names"?
import_*
functionsimport_geos
is function as currently exists on mm-process-functions branch. Spatial subsetting occurs external to the function after data has been cleaned. import_geos_crop_final
crops the final single SpatRaster after all data has been cleaned. import_geos_crop_layer
crops each layer of the data individually after cleaning but before merging with final SpatRaster.
Test with collection aqc_tavg_1hr_g1440x721_v1 (single level):
> microbenchmark(
crop_external = import_geos(
variable = "O3",
directory_with_data = d
) %>>% terra::crop(
project(
ct,
"EPSG:4326"
)
),
crop_final = import_geos_crop_final(
variable = "O3",
directory_with_data = d,
crop = ct
),
crop_layer = import_geos_crop_layer(
variable = "O3",
directory_with_data = d,
crop = ct
),
times = 5
)
Unit: seconds
expr min lq mean median uq max neval cld
crop_external 2.469672 2.484679 2.492802 2.485250 2.497332 2.527078 5 a
crop_final 2.472164 2.472217 2.488969 2.473530 2.498845 2.528091 5 a
crop_layer 2.636986 2.655052 2.912806 2.661639 2.692731 3.917623 5 a
Test with collection chm_inst_1hr_g1440x721_p23 (pressure levels):
> microbenchmark(
+ crop_external = import_geos(
+ variable = "O3",
+ directory_with_data = d
+ ) %>>% terra::crop(
+ project(
+ ct,
+ "EPSG:4326"
+ )
+ ),
+ crop_final = import_geos_crop_final(
+ variable = "O3",
+ directory_with_data = d,
+ crop = ct
+ ),
+ crop_layer = import_geos_crop_layer(
+ variable = "O3",
+ directory_with_data = d,
+ crop = ct
+ ),
+ times = 5
+ )
Unit: seconds
expr min lq mean median uq max neval cld
crop_external 9.940528 10.03314 10.17560 10.19378 10.19413 10.51641 5 a
crop_final 10.039182 10.10433 10.82902 10.12391 10.52642 13.35124 5 a
crop_layer 10.357362 10.50860 10.56581 10.55911 10.62703 10.77697 5 a
At least for GEOS-CF data, performance between these functions was similar, so we can discussion the inclusion of crop =
or subset =
as a standard parameter in the import_*
functions.
Local When run locally, spatial subsetting after layer-specific cleaning is optimal by 10 - 15 seconds.
> microbenchmark(
+ crop_external = import_geos(
+ variable = "O3",
+ directory_with_data = d
+ ) %>>% terra::crop(
+ project(
+ ct,
+ "EPSG:4326"
+ )
+ ),
+ crop_final = import_geos_crop_final(
+ variable = "O3",
+ directory_with_data = d,
+ crop = ct
+ ),
+ crop_layer = import_geos_crop_layer(
+ variable = "O3",
+ directory_with_data = d,
+ crop = ct
+ ),
+ times = 2
+ )
Unit: seconds
expr min lq mean median uq max neval
crop_external 464.0877 464.0877 464.4559 464.4559 464.8242 464.8242 2
crop_final 458.4853 458.4853 460.2615 460.2615 462.0377 462.0377 2
crop_layer 446.7328 446.7328 449.1123 449.1123 451.4919 451.4919 2
>
process_*
functions will do pretty similar functions as modis_*
. I will work on separate processing parts from calc_*
first then unifying and minimizing arguments of process_*
functions.
process_ecoregion
from calc_ecoregion
process_koppen_geiger
from calc_koppen_geiger
process_nei
from calc_nei
process_nlcd_ratio
from calc_nlcd_ratio
process_tri
from calc_tri
calc_modis
modis_*
to process_modis_*
cf. Argument lists by functions in R/calculate_covariates.R
and R/calculate_covariates_support.R
# calc_* function arguments
fun path sites id_col product name_covariates radius subdataset fun_summary nthreads package_list_add
1 calc_ecoregion TRUE TRUE TRUE NA NA NA NA NA NA NA
2 calc_koppen_geiger TRUE TRUE TRUE NA NA NA NA NA NA NA
3 calc_modis TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
4 calc_nei TRUE TRUE TRUE NA NA NA NA NA NA NA
5 calc_nlcd_ratio TRUE TRUE NA NA NA TRUE NA NA NA NA
6 calc_temporal_dummies NA TRUE TRUE NA NA NA NA NA NA NA
7 calc_tri TRUE TRUE TRUE NA NA TRUE NA NA NA NA
export_list_add year county_shp sites_epsg domain_year
1 NA NA NA NA NA
2 NA NA NA NA NA
3 TRUE NA NA NA NA
4 NA TRUE TRUE TRUE NA
5 NA TRUE NA NA NA
6 NA NA NA NA TRUE
7 NA NA NA TRUE TRUE
# modis_* function arguments
fun path product nsds fun_agg paths date_in regex_sds foo get_var resolution custom_sel subdataset crs_ref
1 modis_aggregate_sds TRUE TRUE TRUE TRUE NA NA NA NA NA NA NA NA NA
2 modis_get_vrt NA TRUE NA NA TRUE TRUE TRUE TRUE NA NA NA NA NA
3 modis_mosaic_mod06 NA NA NA NA TRUE TRUE NA NA TRUE TRUE NA NA NA
4 modis_prefilter_sds NA TRUE NA NA NA NA NA NA NA NA TRUE NA NA
5 modis_preprocess_vnp46 NA NA NA NA TRUE TRUE NA NA NA NA NA TRUE TRUE
6 modis_warp_stars TRUE NA NA NA NA NA NA NA NA NA NA NA NA
7 modis_worker NA TRUE NA NA NA NA NA NA NA NA NA NA NA
cellsize threshold crs_out raster date sites_in name_extracted fun_summary_raster id_col radius
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA
6 TRUE TRUE TRUE NA NA NA NA NA NA NA
7 NA NA NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE
@mitchellmanware
As per our plan to split calc_*
into process_*
and calculate_*
and to generalize functions, we may need to rename common arguments in these functions. At least in functions I've been working on, path
and sites
are usually used. I think generic names such as x
and y
or loc
/locs
and from
, which nuance "calculate covariates at x
/loc
/locs
from from
. sites
sounds a little specific to AQS cases. I'd be happy to discuss more generic argument names.
@sigmafelix It's always nice to have as much internal consistency - could potentially even be made into a special use-case advanced linter (e.g. arugment_name_linter)
I agree these general names are more broadly applicable. I think there could be more variability in the process_
function arguments as the raw data can take on many forms, but we can implement vectorized arguments for collection/variable or statistic/resolution to retain consistency.
ie for the GMTED processing function:
gmted <-
import_gmted(
variable = c("Breakline Emphasis", "7.5 arc seconds),
directory_with_data =
"../testdata/gmted/gmted"
)
The variable
argument takes both statistic and resolution but conforms to the simplified argument names.
We can also apply the path
argument to the download functions. I do not recall why I created separate directory_to_download
and directory_to_save
folders.
@mitchellmanware The separation was for distinguishing a directory of compressed raw data from a directory of the unzipped data (what we actually use to process and calculate). directory_to_download
was for the former, whereas directory_to_save
was for the latter. One way to simplify the directory setting could be to force making two directories under directory_to_save
, say, zips
and raw
inside download functions. Exception could be handled with the extension in download links.
@Spatiotemporal-Exposures-and-Toxicology Yes, a custom linter will help a lot! I will work on adding one.
Right, thanks for the reminder. That would allow for succinct folder arguments across functions which is good, but seems low priority for now. I will revisit next week.
main
. I reflected what we've agreed in our discussion. Reflecting my comment in #7, sites
is replaced with locs
, path
remains in process_*
functions, and is replaced with from
in calc_*
functions.Great, thank you Insang. I am very close to being done with my function migration and will create pull request once finished.
@mitchellmanware closing - reopen if needed. 🤙
Updates, additions, and next steps for suite of functions designed to download, process, and calculation covariates from geospatial environmental data sources. @sigmafelix @Spatiotemporal-Exposures-and-Toxicology