test set data - standardize small location for all tests

kyle-messier commented 6 months ago

We can utilize the same area for test sets to be carried throughout the pipeline

kyle-messier commented 6 months ago

[ ] Generalizable function for subsetting -or-
[ ] vignette for general subsetting of final model

sigmafelix commented 6 months ago

I think the easiest and the most relatable spatial range is Durham County, NC, which can be queried from the bundled file in sf package by system.file("shape/nc.shp", package = "sf") (reprojection required). The county has the total area of 772 sq km, which give 772 points given that we use 1 km spatial resolution) and include a substantial part of RTP area (following what Eva proposed).

eva0marques commented 6 months ago

Thank you Insang, I will use this spatial range to create the testdata.

mitchellmanware commented 6 months ago

I will utilize the Durham County, NC area for test data associated with the NCEP NARR and GEOS-CF data sources.

A full data set for the "pressure levels" NCEP NARR data may have to be included due to the layer naming structure used in the netCDF file. The full netCDF does not indicate each layer's "pressure level" other than in the layer names, which is lost when trying to re-write as .nc. I will continue to explore this further.

mitchellmanware commented 6 months ago

@eva0marques

Do you plan to/have you already created a subset of monitor sites within Durham County, NC for use in testing? If not I will create one.

eva0marques commented 6 months ago

Sorry for the absence of answer, I completely cut my use of computer during the holidays. I will start working on this task these days. The idea is to create a vignette to extract Durham data for each covariates. For the future pipeline steps I take into consideration your suggestion @mitchellmanware to also extract monitors within Durham county (or create fake ones if there is no existing monitors there).

eva0marques commented 6 months ago

Testdata will be created at each step of the pipeline to test the following step.

Testdata content:

[ ] Raw covariate files They are created based on the extraction of Durham county from downloaded contiguous US files. They will be used to test all covariate calculation functions.
[ ] Monitors Extracted from all contiguous US dataset or created from scratch if there is no existing monitors in Durham county. Used to test covariate calculation at monitor locations.
[ ] Calculated covariates This data is generated from raw covariate files with the covariate calculation functions. It will be used to test prediction grid creation.
[ ] Prediction grid

eva0marques commented 6 months ago

Note: the resolution of some covariates is very low. Selecting only Durham county means that we only test data on 1 pixel for some of the covariates (eg: NARR).

mitchellmanware commented 6 months ago

@eva0marques

Do you encounter any errors or warnings when reading the test data sets into R? When I was playing with the test data sets the 1 x 1 dimensions resulted in extent-related errors.

kyle-messier commented 6 months ago

@mitchellmanware Could result in identifiability issues (e.g. if you regress Y on X and X only has 1 value).

mitchellmanware commented 6 months ago

@Spatiotemporal-Exposures-and-Toxicology @eva0marques

Yes, I thought the same thing. Maybe we should expand the area to Durham County and a few of its neighbors?

eva0marques commented 6 months ago

@mitchellmanware I don't get your problem, can you give me more details? I can open NARR data properly and when I extract Durham county I only have one pixel yes. I will check if the whole county is covered by the pixel.

mitchellmanware commented 6 months ago

For me, running the following code caused an extent error due which may have been due to the dimensions being 1x1. The same error did not appear when I ran the code this time.

> # north carolina
> nc <- tigris::counties(state = "North Carolina")
Retrieving data for the year 2021
Using FIPS code '37' for state 'North Carolina'
> # durham
> du <- nc[nc$NAME == "Durham",]
> # durham as terra vect
> du_v <- vect(du)
> # narr air.sfc
> a <- rast("../../data/narr/air.sfc/air.sfc.2018.nc")
> # crop
> a_crop <- crop(
+   a,
+   project(du_v, crs(a))
+ )
> # plot
> terra::plot(a_crop$air_1)


> # save
> writeCDF(
+   a_crop,
+   filename = "./air.sfc.2018.CROP.nc",
+   varname = "air"
+ )
> # reread
> a_crop_reread <- rast("./air.sfc.2018.CROP.nc")
> a_crop_reread
class       : SpatRaster 
dimensions  : 1, 1, 365  (nrow, ncol, nlyr)
resolution  : 32462.99, 32463  (x, y)
extent      : 8164442, 8196905, 3522236, 3554698  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=lcc +lat_0=50 +lon_0=-107 +lat_1=50 +lat_2=50 +x_0=5632642.22547 +y_0=4612545.65137 +datum=WGS84 +units=m +no_defs 
source      : air.sfc.2018.CROP.nc 
varname     : air 
names       : air_1, air_2, air_3, air_4, air_5, air_6, ... 
time        : 2018-01-01 to 2018-12-31 UTC 
>

eva0marques commented 6 months ago

Ok, I plotted Durham shp on it and it is true that we are missing areas. But what I need to do is just correct the extraction to get all the necessary tiles. Is there any other problem I don't get?

eva0marques commented 6 months ago

Just need to add the argument snap = "out" to the crop funciton.

mitchellmanware commented 6 months ago

@eva0marques

Per discussion with @Spatiotemporal-Exposures-and-Toxicology, we could also use Durham and Wake counties as the test area. It ensures that we have dimensions greater than 1x1 and could be good to include multiple monitor locations in the testing.

mitchellmanware commented 6 months ago

@eva0marques

Also, the omega NARR variable was causing some problems because the "pressure level" data was only stored in the layer names of the data.

> o <- terra::rast(
+   paste0(dir, "data/narr/omega/omega.201801.nc")
+ )
> names(o)[1:20]
 [1] "omega_level=1000_1" "omega_level=975_1"  "omega_level=950_1" 
 [4] "omega_level=925_1"  "omega_level=900_1"  "omega_level=875_1" 
 [7] "omega_level=850_1"  "omega_level=825_1"  "omega_level=800_1" 
[10] "omega_level=775_1"  "omega_level=750_1"  "omega_level=725_1" 
[13] "omega_level=700_1"  "omega_level=650_1"  "omega_level=600_1" 
[16] "omega_level=550_1"  "omega_level=500_1"  "omega_level=450_1" 
[19] "omega_level=400_1"  "omega_level=350_1"

When the data was subset to NC and written as a netCDF, the pressure level data was lost. One way to overcome this is by selecting only the level=1000 layers, and writing the variable as varname = "omega_level=1000. I will look for other ways to retain the pressure levels information and use it to write the variable names, but this should work in the interim for our test data set. Suggestion below.

> o <- terra::rast(
+   paste0(dir, "data/narr/omega/omega.201801.nc")
+ )
> o1000 <- terra::subset(
+   o,
+   subset = grep(
+     paste0(
+       "level=1000"
+     ),
+     names(o)
+   )
+ )
> nc <- tigris::counties(state = "NC")
Retrieving data for the year 2021
Using FIPS code '37' for state 'NC'
> nc_v <- terra::vect(nc)
> nc_v <- nc_v[nc_v$NAME %in% c("Durham", "Wake"),]
> nc_p <- terra::project(nc_v, terra::crs(o))
> o1000_crop <- terra::crop(
+   o1000,
+   nc_p,
+   snap = "out"
+ )
> terra::writeCDF(
+   o1000_crop,
+   paste0(dir, "/tests/testdata/omega/omega.201801.CROP.nc"),
+   varname = "omega_level=1000"
+ )
> o_reread <- terra::rast(
+   paste0(dir, "/tests/testdata/omega/omega.201801.CROP.nc")
+ )
> names(o_reread)
 [1] "omega_level=1000_1"  "omega_level=1000_2"  "omega_level=1000_3" 
 [4] "omega_level=1000_4"  "omega_level=1000_5"  "omega_level=1000_6" 
 [7] "omega_level=1000_7"  "omega_level=1000_8"  "omega_level=1000_9" 
[10] "omega_level=1000_10" "omega_level=1000_11" "omega_level=1000_12"
[13] "omega_level=1000_13" "omega_level=1000_14" "omega_level=1000_15"
[16] "omega_level=1000_16" "omega_level=1000_17" "omega_level=1000_18"
[19] "omega_level=1000_19" "omega_level=1000_20" "omega_level=1000_21"
[22] "omega_level=1000_22" "omega_level=1000_23" "omega_level=1000_24"
[25] "omega_level=1000_25" "omega_level=1000_26" "omega_level=1000_27"
[28] "omega_level=1000_28" "omega_level=1000_29" "omega_level=1000_30"
[31] "omega_level=1000_31"

eva0marques commented 6 months ago

@mitchellmanware

Ok for downloading Orange - Durham and Wake counties!
For the NARR-omega data: I am still working on it. My plan is really to create the exact same raw data files but smaller (one RTP area for a very short period of time). This task I am working on is not data processing / covariate calculation. But once I have generated all raw samples on RTP we will be able to run unit tests and apply calculation functions on them!

eva0marques commented 6 months ago

@mitchellmanware
In the end I just extract dates like this, using the time function in terra.

o <- terra::rast(paste0(dir, "data/narr/omega/omega.201801.nc"))
o <- terra::rast(file_path)
o[[terra::time(o) == as.Date("2022-06-01")]]

kyle-messier commented 6 months ago

[ ] If NetCDF cropping extent for test area can't be resolved, try 1 day for global extent.
[ ] If needed, look into caching read in GitHub CI
[ ] What are the file sizes of ./testdata ?

mitchellmanware commented 6 months ago

This code has successfully created small spatial resolution test data set for GEOS chm and aqc collections. They include only tested variable (ACET and CO, respectively) for lightweight testing. Were successful for testing.

## in folder with one day's worth of GEOS aqc files
files <- list.files()
dw <- terra::vect("../durham_wake.shp")
dw_p <- terra::project(dw, "EPSG:4326")
for (f in seq_along(files)) {
  data <- terra::rast(files[f], subds = "CO")
  terra::crs(data) <- "EPSG:4326"
  data_p <- terra::project(data, "EPSG:4326")
  data_crop <- terra::crop(data_p,
                           dw_p,
                           snap = "out")
  writeCDF(data_crop,
           paste0("../aqc/", files[f]),
           varname = "CO")
}

## in folder with one day's worth of GEOS chm files
files <- list.files()
dw <- terra::vect("../durham_wake.shp")
dw_p <- terra::project(dw, "EPSG:4326")
for (f in seq_along(files)) {
  data <- terra::rast(files[f], subds = "ACET")
  terra::crs(data) <- "EPSG:4326"
  data_p <- terra::project(data, "EPSG:4326")
  data_crop <- terra::crop(data_p,
                           dw_p,
                           snap = "out")
  writeCDF(data_crop,
           paste0("../chm/", files[f]),
           varname = "ACET")
}

eva0marques commented 6 months ago

Ok, thank you for your help! So you do not use the other netcdf metadata (units etc). Also, do we admit that it is not important to have all the covariates in the testdata folder? It may cause issues in the future unit testing of the pipeline but maybe we can start like this and complete the folder once needed.

sigmafelix commented 6 months ago

Multiple layers can be written by one command with setting arguments:

library(terra)
#> terra 1.7.67

moddir <-
    file.path("/ddn/gs1/group/set",
              "Projects",
              "NRT-AP-Model",
              "input",
              "modis",
              "raw",
              "61",
              "MOD09GA")
modfiles <- list.files(moddir, "*.hdf", full.names = TRUE, recursive = TRUE)

# when split=TRUE
writeCDF(
    mod11_1,
    "~/writetest.nc",
    varname = names(mod11_1),
    overwrite = TRUE,
    unit = units(mod11_1),
    split = TRUE)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'writeCDF': object 'mod11_1' not found
modtest <- rast("~/writetest.nc")
modtest
#> class       : SpatRaster 
#> dimensions  : 2400, 2400, 12  (nrow, ncol, nlyr)
#> resolution  : 463.3127, 463.3127  (x, y)
#> extent      : -8895604, -7783654, 3335852, 4447802  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> sources     : writetest.nc:num_observations_500m  
#>               writetest.nc:sur_refl_b01_1  
#>               writetest.nc:sur_refl_b02_1  
#>               ... and 9 more source(s)
#> varnames    : num_observations_500m 
#>               sur_refl_b01_1 
#>               sur_refl_b02_1 
#>               ...
#> names       : num_o~_500m, sur_r~b01_1, sur_r~b02_1, sur_r~b03_1, sur_r~b04_1, sur_r~b05_1, ...

modtestsub <- rast("~/writetest.nc", subds = 2)
modtestsub
#> class       : SpatRaster 
#> dimensions  : 2400, 2400, 1  (nrow, ncol, nlyr)
#> resolution  : 463.3127, 463.3127  (x, y)
#> extent      : -8895604, -7783654, 3335852, 4447802  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> source      : writetest.nc:sur_refl_b01_1 
#> varname     : sur_refl_b01_1 
#> name        : sur_refl_b01_1

# print by subdatasets
describe("/ddn/gs1/home/songi2/writetest.nc", sds = TRUE)
#>    id                                                             name
#> 1   1 NETCDF:"/ddn/gs1/home/songi2/writetest.nc":num_observations_500m
#> 2   2        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b01_1
#> 3   3        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b02_1
#> 4   4        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b03_1
#> 5   5        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b04_1
#> 6   6        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b05_1
#> 7   7        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b06_1
#> 8   8        NETCDF:"/ddn/gs1/home/songi2/writetest.nc":sur_refl_b07_1
#> 9   9             NETCDF:"/ddn/gs1/home/songi2/writetest.nc":QC_500m_1
#> 10 10         NETCDF:"/ddn/gs1/home/songi2/writetest.nc":obscov_500m_1
#> 11 11            NETCDF:"/ddn/gs1/home/songi2/writetest.nc":iobs_res_1
#> 12 12              NETCDF:"/ddn/gs1/home/songi2/writetest.nc":q_scan_1
#>                      var
#> 1  num_observations_500m
#> 2         sur_refl_b01_1
#> 3         sur_refl_b02_1
#> 4         sur_refl_b03_1
#> 5         sur_refl_b04_1
#> 6         sur_refl_b05_1
#> 7         sur_refl_b06_1
#> 8         sur_refl_b07_1
#> 9              QC_500m_1
#> 10         obscov_500m_1
#> 11            iobs_res_1
#> 12              q_scan_1
#>                                                         desc nrow ncol nlyr
#> 1  [2400x2400] num_observations_500m (32-bit floating-point) 2400 2400    1
#> 2         [2400x2400] sur_refl_b01_1 (32-bit floating-point) 2400 2400    1
#> 3         [2400x2400] sur_refl_b02_1 (32-bit floating-point) 2400 2400    1
#> 4         [2400x2400] sur_refl_b03_1 (32-bit floating-point) 2400 2400    1
#> 5         [2400x2400] sur_refl_b04_1 (32-bit floating-point) 2400 2400    1
#> 6         [2400x2400] sur_refl_b05_1 (32-bit floating-point) 2400 2400    1
#> 7         [2400x2400] sur_refl_b06_1 (32-bit floating-point) 2400 2400    1
#> 8         [2400x2400] sur_refl_b07_1 (32-bit floating-point) 2400 2400    1
#> 9              [2400x2400] QC_500m_1 (32-bit floating-point) 2400 2400    1
#> 10         [2400x2400] obscov_500m_1 (32-bit floating-point) 2400 2400    1
#> 11            [2400x2400] iobs_res_1 (32-bit floating-point) 2400 2400    1
#> 12              [2400x2400] q_scan_1 (32-bit floating-point) 2400 2400    1

# when split=FALSE
writeCDF(
    mod11_1,
    "~/writetest_nosplit.nc",
    varname = names(mod11_1),
    overwrite = TRUE,
    unit = units(mod11_1),
    split = FALSE)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'writeCDF': object 'mod11_1' not found
modtest_split <- rast("~/writetest_nosplit.nc")

# does the exported file detect subdataset?
modtest_split2 <- rast("~/writetest_nosplit.nc", subds = 1)
modtest_split2
#> class       : SpatRaster 
#> dimensions  : 2400, 2400, 12  (nrow, ncol, nlyr)
#> resolution  : 463.3127, 463.3127  (x, y)
#> extent      : -8895604, -7783654, 3335852, 4447802  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> source      : writetest_nosplit.nc 
#> varname     : num_observations_500m 
#> names       : num_o~00m_1, num_o~00m_2, num_o~00m_3, num_o~00m_4, num_o~00m_5, num_o~00m_6, ...
modtest_split2 <- rast("~/writetest_nosplit.nc", subds = 13)
modtest_split2
#> class       : SpatRaster 
#> dimensions  : 2400, 2400, 12  (nrow, ncol, nlyr)
#> resolution  : 463.3127, 463.3127  (x, y)
#> extent      : -8895604, -7783654, 3335852, 4447802  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> source      : writetest_nosplit.nc 
#> varname     : num_observations_500m 
#> names       : num_o~00m_1, num_o~00m_2, num_o~00m_3, num_o~00m_4, num_o~00m_5, num_o~00m_6, ...
# print description by subdatasets
describe("/ddn/gs1/home/songi2/writetest_nosplit.nc", sds = TRUE)
#> Error: [gdal (sds)] no subdatasets

## GEOS example
geosdir <-
    file.path("/ddn/gs1/group/set",
              "Projects",
              "NRT-AP-Model",
              "input",
              "geos")
geosfiles <-
    list.files(geosdir,
               "*.nc4",
               full.names = TRUE,
               recursive = TRUE)
# 2018001
geosf <- geosfiles[seq(1, 24)]
# list of SpatRasters
geos2018001 <- lapply(geosf, rast)
# SpatRasterDataset
geos_c <- do.call(sds, geos2018001)
# Aggregate by day
geos_capp <- app(geos_c, index = "days", fun = "mean")
writeCDF(
    geos_capp,
    "~/writegeostest.nc",
    varname = names(geos_capp),
    overwrite = TRUE,
    unit = units(geos2018001[[1]]),
    split = TRUE)
geostest <- rast("~/writegeostest.nc")
geostest
#> class       : SpatRaster 
#> dimensions  : 721, 1440, 5  (nrow, ncol, nlyr)
#> resolution  : 0.25, 0.25  (x, y)
#> extent      : -180.125, 179.875, -90.125, 90.125  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> sources     : writegeostest.nc:CO_lev=72  
#>               writegeostest.nc:NO2_lev=72  
#>               writegeostest.nc:O3_lev=72  
#>               ... and 2 more source(s)
#> varnames    : CO_lev=72 
#>               NO2_lev=72 
#>               O3_lev=72 
#>               ...
#> names       : CO_lev=72, NO2_lev=72, O3_lev=72, PM25_R~lev=72, SO2_lev=72 
#> unit        : mol mol-1,  mol mol-1, mol mol-1,        ug m-3,  mol mol-1 
#> time        : 2018-01-01 00:30:00 UTC

^{Created on 2024-01-04 with reprex v2.0.2}

While it is hard to say that the exported file is the exactly same as the raw input(s), we might make these files fairly equal by adding detailed arguments in writeCDF or using ncdf4::ncvar_def.

mitchellmanware commented 6 months ago

@sigmafelix @eva0marques

It is a good point to include all covariates for future testing in the pipeline. The issue with the test data set created in Insang's code is that it will not work with the calc_geos function. The function was written to import raw GEOS-CF data based on the downloaded file name, and calculates a daily mean within the function. Insang's recommendation of split = TRUE worked well to retain variable names.

> # working dir = "/ddn/gs1/home/manwareme/NRTAPModel_Covariates/tests/testdata/chm_full/"
> files <- list.files(pattern = ".nc4")
> dw <- terra::vect("../durham_wake.shp")
> dw_p <- terra::project(dw, "EPSG:4326")
> for (f in seq_along(files)) {
+   data <- terra::rast(files[f])
+   terra::crs(data) <- "EPSG:4326"
+   data_p <- terra::project(data, "EPSG:4326")
+   data_crop <- terra::crop(data_p,
+                            dw_p,
+                            snap = "out")
+   writeCDF(data_crop,
+            # written with full name to different folder
+            paste0("../chm/", files[f]),
+            varname = names(data_crop),
+            split = TRUE,
+            # retain units
+            unit = units(data))
+ }

The written netCDF retains original file name, variable names, and units, and is cropped to the desired extent.

> geos_crop_read <- terra::rast("../chm/GEOS-CF.v01.rpl.chm_tavg_1hr_g1440x721_v1.20180101_0030z.nc4")
> geos_crop_read
class       : SpatRaster 
dimensions  : 4, 4, 52  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -79.125, -78.125, 35.375, 36.375  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
sources     : GEOS-CF.v01.rpl.chm_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:ACET_lev=72  
              GEOS-CF.v01.rpl.chm_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:ALD2_lev=72  
              GEOS-CF.v01.rpl.chm_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:ALK4_lev=72  
              ... and 49 more source(s)
varnames    : ACET_lev=72 
              ALD2_lev=72 
              ALK4_lev=72 
              ...
names       : ACET_lev=72, ALD2_lev=72, ALK4_lev=72, BCPI_lev=72, BCPO_lev=72, BENZ_lev=72, ... 
unit        :   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1, ... 
>

The only issue with this is the amount of data required as both the GEOS-CF chm and aqc collections will require 24 files (size = 0.058544 mb each). A workaround could be to zip these files and add an unzip() command in the unit and integration tests, because these are the only tests that would require these raw data.

mitchellmanware commented 6 months ago

Minor tweak in calc_geos function to select variable after import with grep() function. Allows for subsetting with raw data variable names (ie "ACET") and re-written test variable names (ie "ACET_lev=72"). calc_geos unit tests pass with this change.

> 
> 
> testthat::test_that("calc_geos calculates chm covariates.", {
+   # function parameters
+   date <- "2018-01-01"
+   collection <- "chm_tavg_1hr_g1440x721_v1"
+   variable <- "ACET"
+   sites <- readRDS("../testdata/sites_test.rds")
+   identifier <- "site_id"
+   fun <- "mean"
+   directory_with_data <- "../testdata/chm/"
+   # run covariate calculation
+   covariate_geos <- calc_geos(date_start = date,
+                               date_end = date,
+                               collection = collection,
+                               variable = variable,
+                               sites = sites,
+                               identifier = identifier,
+                               fun = fun,
+                               directory_with_data = directory_with_data)
+   # test that function returns data frame
+   expect_equal(class(covariate_geos), "data.frame")
+   # test that data frame has three columns
+   expect_equal(ncol(covariate_geos), 3)
+   # test that no missing values
+   expect_false(any(sapply(covariate_geos, is.na)))
+   # test that all days included
+   expect_equal(length(unique(covariate_geos$date)), 1)
+ })
Calculated ACET for date 20180101
Test passed 😀
>

sigmafelix commented 6 months ago

@mitchellmanware Thank you for the changes in calc_geos.

@Spatiotemporal-Exposures-and-Toxicology @eva0marques I am suddenly confused with the temporal range of subsetting: do we subset all data for 2018-2022 or a limited period? Since the subsets will be used in the entire pipeline for tests, I suppose the former would be the right period.

kyle-messier commented 6 months ago

@sigmafelix pipeline is currently for 2018-2022 inclusive; tests are for a very limited (1 day) period. Does that answer the question?

sigmafelix commented 6 months ago

@Spatiotemporal-Exposures-and-Toxicology Thank you for the clarification. I was briefly confused with "testing pipeline" and "testing functions." Now I realized that this issue is more about the latter.

kyle-messier commented 6 months ago

@sigmafelix @mitchellmanware No problem - it's always good to test more. For instance, we may consider doing a so-called "edge test" with time. This may be checking that 12-31-2022 is valid and 1-1-2023 throws an error. i.e. at the very edge of some domain - this case it is time.

eva0marques commented 6 months ago

Thank you all for your help. I am currently writing all the code to extract Durham - Wake - Orange counties on 2 days of august 2022 for every downloaded files in input/ folder. The objective is to provide real (but very light) data for the unit tests all along the pipeline. @mitchellmanware I will extract all variables of GEOS CF with the help of your code. @all I know it needs some patience but maybe you could give me a little bit more time before testing your calc_ functions. Otherwise, we are putting the cart before the horse (is that the right expression ? ☺️ in French it is the plow before the cow!). And we run the risk of doing the same work twice.

sigmafelix commented 6 months ago

@eva0marques I am using the code I've put above to make a short vignette to introduce generating small subsets from the raw data. If you are working on the documentation, I will pause it.

mitchellmanware commented 6 months ago

@eva0marques

calc_narr and calc_geos can be found on branch mm-calc-functions in R/ folder.

eva0marques commented 6 months ago

@sigmafelix Yes this is my task, it's been several days I am working on it...

sigmafelix commented 6 months ago

@eva0marques Got it, I look forward to seeing a documentation soon!

kyle-messier commented 6 months ago

HDF4 and/or HDF5 file formats to be subsetted with another tool (e.g. python)

eva0marques commented 5 months ago

Work in progress:

Testdata extraction function done
- [x] NLCD
- [x] Ecoregions
- [x] Koppen-Geiger classification
- [x] NARR reanalysis
- [x] GEOS-CF
- [x] TRI
- [x] AQS
- [x] NEI
- [x] HMS_Smoke: dates extracted are not the same (smoke polygons were more interesting for June 18th and 21st, 2022)
- [ ] modis
- [ ] GMTED
Testdata generated in my local repo /tests/testdata/:
- [x] NLCD
- [x] Ecoregions
- [x] Koppen-Geiger classification
- [x] NARR reanalysis
- [x] GEOS-CF
- [x] TRI
- [x] AQS
- [x] NEI
- [x] HMS_Smoke: dates extracted are not the same (smoke polygons were more interesting for June 18th and 21st, 2022)
- [ ] modis
- [ ] GMTED

kyle-messier commented 5 months ago

[ ] @Spatiotemporal-Exposures-and-Toxicology Check on NIEHS Enterprise storage allowances for testdata. MODIS data subsetting

eva0marques commented 5 months ago

@Spatiotemporal-Exposures-and-Toxicology @mitchellmanware

I would like to create and save the testdata for GEOS-CF CHM directly on the local repo in the /ddn. But I have a permission denied error. Can you change my writing rights to this folder (/Volumes/set/Projects/NRT-AP-Model/tests/testdata/)?

kyle-messier commented 5 months ago

@eva0marques try now - if not we can resolve during the project meeting

eva0marques commented 5 months ago

It works now, thanks!!! 😃

mitchellmanware commented 4 months ago

Test data for the NARR weasd and omega data sets have been created on branch mm-copy-functions. I used the ncdf4 package to copy the raw metadata onto a subset of the data and rewrite with all metadata and layer names. A messy version of the code is here (https://github.com/mitchellmanware/amadeus_workbench/blob/main/narr/create_NARR_test_data.R) but I will clean it up to be a template or vignette.

eva0marques commented 4 months ago

@mitchellmanware can you remind me what you told me at last meeting about the problem with my testdata function for NARR?

mitchellmanware commented 4 months ago

@eva0marques The problem is not with your function but with the way terra writes netCDF files. Your functions create perfect subsets within the R environment, but terra::writeCDF() does not write all of the necessary metadata to then re-read the subsetted data with terra::rast(). The difference is visible when inspecting the contents of the .nc files with ncdf4::nc_open()

Raw weasd data:

> weasd_raw <- nc_open("nrtapmodel/covariates/data/narr/weasd/weasd.2018.nc")
> weasd_raw
File nrtapmodel/covariates/data/narr/weasd/weasd.2018.nc (NC_FORMAT_NETCDF4_CLASSIC):

     5 variables (excluding dimension variables):
        float lat[x,y]   (Contiguous storage)  
            axis: Y
            coordinate_defines: point
            long_name: Latitude
            standard_name: latitude
            units: degrees_north
        float lon[x,y]   (Contiguous storage)  
            axis: X
            coordinate_defines: point
            long_name: Longitude
            standard_name: longitude
            units: degrees_east
        int Lambert_Conformal[]   (Contiguous storage)  
            false_easting: 5632642.22547
            false_northing: 4612545.65137
            grid_mapping_name: lambert_conformal_conic
            latitude_of_projection_origin: 50
            longitude_of_central_meridian: -107
            standard_parallel: 50
             standard_parallel: 50
        float weasd[x,y,time]   (Chunking: [349,277,1])  (Compression: shuffle,level 1)
            GRIB_id: 65
            GRIB_name: WEASD
            _FillValue: 9.96920996838687e+36
            coordinates: lat lon
            grid_mapping: Lambert_Conformal
            level_desc: Surface
            standard_name: surface_snow_amount
            units: kg/m^2
            var_desc: accumulated snow
            missing_value: -9.96920996838687e+36
            dataset: NARR Daily Averages
            long_name: Daily Accumulated Snow at Surface
            parent_stat: Individual Obs
            statistic: Mean
            actual_range: 0
             actual_range: 746.8125
            valid_range: 0
             valid_range: 3000
        double time_bnds[nbnds,time]   (Chunking: [2,1])  

     4 dimensions:
        time  Size:365   *** is unlimited *** 
            axis: T
            long_name: Time
            standard_name: time
            units: hours since 1800-1-1 00:00:0.0
            delta_t: 0000-00-01 00:00:00
            avg_period: 0000-00-01 00:00:00
            actual_range: 1910952
             actual_range: 1919688
            coordinate_defines: start
        y  Size:277 
            long_name: northward distance from southwest corner of domain in projection coordinates
            standard_name: projection_y_coordinate
            units: m
        x  Size:349 
            long_name: eastward distance from southwest corner of domain in projection coordinates
            standard_name: projection_x_coordinate
            units: m
        nbnds  Size:2 (no dimvar)

    16 global attributes:
        Conventions: CF-1.2
        centerlat: 50
        centerlon: -107
        comments: 
        institution: National Centers for Environmental Prediction
        latcorners: 1.00000095367432
         latcorners: 0.897944986820221
         latcorners: 46.3544006347656
         latcorners: 46.6343307495117
        loncorners: -145.5
         loncorners: -68.3200531005859
         loncorners: -2.56989097595215
         loncorners: 148.641799926758
        platform: Model
        standardpar1: 50
        standardpar2: 50.000001
        title: Daily NARR
        history: created Sat Mar 26 05:32:28 MDT 2016 by NOAA/ESRL/PSD
        dataset_title: NCEP North American Regional Reanalysis (NARR)
        references: https://www.esrl.noaa.gov/psd/data/gridded/data.narr.html
        source: http://www.emc.ncep.noaa.gov/mmb/rreanl/index.html
        References:

Cropped with terra and rewritten

>  weasd_terra <- terra::rast("nrtapmodel/covariates/data/narr/weasd/weasd.2018.nc")
> nc <- terra::vect(system.file("shape/nc.shp", package = "sf"))
> co <- terra::project(nc[nc$NAME %in% c("Durham", "Wake", "Orange")],
+                      terra::crs(weasd_terra))
> weasd_terra_crop <- terra::crop(weasd_terra, co, snap = "out")
> terra::writeCDF(weasd_terra_crop, "~/Desktop/weasd.2018.nc")
> weasd_terra_reread <- nc_open("~/Desktop/weasd.2018.nc")
> weasd_terra_reread
File ~/Desktop/weasd.2018.nc (NC_FORMAT_NETCDF4):

     2 variables (excluding dimension variables):
        float weasd.2018[easting,northing,time]   (Contiguous storage)  
            _FillValue: -1.17549402418441e+38
            grid_mapping: crs
        int crs[]   (Contiguous storage)  
            crs_wkt: PROJCRS["unnamed",    BASEGEOGCRS["WGS 84",        DATUM["World Geodetic System 1984",            ELLIPSOID["WGS 84",6378137,298.257223563,                LENGTHUNIT["metre",1]]],        PRIMEM["Greenwich",0,            ANGLEUNIT["degree",0.0174532925199433]],        ID["EPSG",4326]],    CONVERSION["unnamed",        METHOD["Lambert Conic Conformal (2SP)",            ID["EPSG",9802]],        PARAMETER["Latitude of false origin",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8821]],        PARAMETER["Longitude of false origin",-107,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8822]],        PARAMETER["Latitude of 1st standard parallel",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8823]],        PARAMETER["Latitude of 2nd standard parallel",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8824]],        PARAMETER["Easting at false origin",5632642.22547,            LENGTHUNIT["metre",1],            ID["EPSG",8826]],        PARAMETER["Northing at false origin",4612545.65137,            LENGTHUNIT["metre",1],            ID["EPSG",8827]]],    CS[Cartesian,2],        AXIS["easting",east,            ORDER[1],            LENGTHUNIT["metre",1,                ID["EPSG",9001]]],        AXIS["northing",north,            ORDER[2],            LENGTHUNIT["metre",1,                ID["EPSG",9001]]]]
            spatial_ref: PROJCRS["unnamed",    BASEGEOGCRS["WGS 84",        DATUM["World Geodetic System 1984",            ELLIPSOID["WGS 84",6378137,298.257223563,                LENGTHUNIT["metre",1]]],        PRIMEM["Greenwich",0,            ANGLEUNIT["degree",0.0174532925199433]],        ID["EPSG",4326]],    CONVERSION["unnamed",        METHOD["Lambert Conic Conformal (2SP)",            ID["EPSG",9802]],        PARAMETER["Latitude of false origin",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8821]],        PARAMETER["Longitude of false origin",-107,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8822]],        PARAMETER["Latitude of 1st standard parallel",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8823]],        PARAMETER["Latitude of 2nd standard parallel",50,            ANGLEUNIT["degree",0.0174532925199433],            ID["EPSG",8824]],        PARAMETER["Easting at false origin",5632642.22547,            LENGTHUNIT["metre",1],            ID["EPSG",8826]],        PARAMETER["Northing at false origin",4612545.65137,            LENGTHUNIT["metre",1],            ID["EPSG",8827]]],    CS[Cartesian,2],        AXIS["easting",east,            ORDER[1],            LENGTHUNIT["metre",1,                ID["EPSG",9001]]],        AXIS["northing",north,            ORDER[2],            LENGTHUNIT["metre",1,                ID["EPSG",9001]]]]
            proj4: +proj=lcc +lat_0=50 +lon_0=-107 +lat_1=50 +lat_2=50 +x_0=5632642.22547 +y_0=4612545.65137 +datum=WGS84 +units=m +no_defs
            geotransform: 8131978.62068965472281 32462.98850574716925621 0 3554698.5 0 -32463

     3 dimensions:
        easting  Size:4 
            units: meter
            long_name: easting
        northing  Size:3 
            units: meter
            long_name: northing
        time  Size:365 
            units: seconds since 1970-1-1 00:00:00
            long_name: time
            calendar: standard

    3 global attributes:
        Conventions: CF-1.4
        created_by: R packages ncdf4 and terra (version 1.7-65)
        date: 2024-02-16 14:42:09

Using ncdf4 package to recreate all metadata

> weasd_ncdf4 <- nc_open("amadeus/tests/testdata/narr/weasd/weasd.2018.nc")
> weasd_ncdf4
File amadeus/tests/testdata/narr/weasd/weasd.2018.nc (NC_FORMAT_CLASSIC):

     4 variables (excluding dimension variables):
        float weasd[x,y,time]   
            units: kg/m^2
            _FillValue: 1.00000003318135e+32
            long_name: Daily Accumulated Snow at Surface
            grid_mapping: Lambert_Conformal
            coordinates: lat lon
        double lon[x,y]   
            units: degrees_east
            long_name: Longitude of cell center
        double lat[x,y]   
            units: degrees_north
            long_name: Latitude of cell center
        char Lambert_Conformal[]   
            units: 1
            name: lambert_conformal_conic
            long_name: lambert_conformal_conic
            grid_mapping_name: lambert_conformal_conic
            longitude_of_central_meridian: -107
            latitude_of_projection_origin: 50
            standard_parallel: 50
             standard_parallel: 50
            false_easting: 5632642.22547
            false_northing: 4612545.65137
            _CoordinateTransformType: Projection
            _CoordinateAxisTypes: GeoX GeoY

     3 dimensions:
        x  Size:4 
            units: m
            long_name: eastward distance from southwest corner of domain in projection coordinates
            axis: X
            standard_name: projection_x_coordinate
            _CoordinateAxisType: GeoX
        y  Size:3 
            units: m
            long_name: northward distance from southwest corner of domain in projection coordinates
            axis: Y
            standard_name: projection_y_coordinate
            _CoordinateAxisType: GeoY
        time  Size:365 
            units: hours since 1800-1-1 00:00:0.0
            long_name: time

    5 global attributes:
        title: test output of projected data
        institution: NOAA ESRL PSD
        source: weasd
        history: P.J. Bartlein, Fri Feb 16 11:43:37 2024
        Conventions: CF=1.6

You will notice in this last chunk the metadata is retained by the dimensions match our desired subset.

eva0marques commented 4 months ago

I see 🤔 I'm sorry about that...

mitchellmanware commented 4 months ago

No need to be sorry! Maybe we can pursue writing a more general function to write metadata to one netCDF from another and then apply that to the create test data functions

NIEHS / beethoven

test set data - standardize small location for all tests #233