Closed sigmafelix closed 8 months ago
Some MODIS products have some missing days in the graticules in the study area. For example, MCD19A2 (Aerosol Optical Depth) is all missing on January 1st, 2022. We will need to impute values on such missing days.
Our study area lies in h07v06 (upper left) ~ h13v03 (lower right). No images are present on January 1st, 2022.
MOD11A1: days 2022284-2022295 are missing.
VNP46A2 products are in HDF5 format without georeferencing. I will deal with this problem in a processing script I am working on.
MODIS download is approximately 50% done now. Data are downloaded directly to my DDN user directory at the moment, and the entire dataset will be moved after the completion of the calculation.
download_modis_data
is re-revised to check the NASA archive for folders in each year then retrieve the available dates only.
As discussed before, MOD06_L2 (cloud cover) is based on time, which leads to difficulty in selecting scenes at relevant time frames with spatial extents. We need two elements--day and night cloud cover--from the images, but it appears that MOD06_L2 has no such entries (reference). The easiest way is to download all 5-minute products for five years, which takes considerably long time and large amount of storage (10+GB / day; approx. 18TB in ddn storage). I think @mitchellmanware could explain the exact elements we want to use out of dozens of layers in each scene. The covariate table on Teams has no references for these two entries.
@sigmafelix
Cloud coverage from MODIS satellites, as well as surface reflectance, surface temperature (day and night) were included as covariates in An Ensemble Learning Approach for Estimating High Spatiotemporal Resolution of Ground-Level Ozone in the Contiguous United States (Requia et al., 2020). The full list of specific variables used in this model can be seen in the paper's Supporting Information. I will look into the specific layers used from each satellite, as well as if there is an alternative MODIS product for cloud coverage.
Also, thank you for bringing the "Reference Articles" column to my attention. I will spend time today and tomorrow to ensure that each article is properly referenced for each covariate in the Excel file, and will also create a more formal list of citations.
@mitchellmanware I will look at the SI to identify relevant layers. Thank you.
For the record, MODIS images are tiled; leveraging virtual rasters by terra::vrt
helps a lot.
Aerosol products (MCD19A2) have multiple orbit overpasses per subdataset, which result in multiple layers in each subdataset (LPDAAC reference). For example, aerosol optical depth at 0.47 micrometer wavelength typically has four layers a day. Could I take mean values of these layers to represent each day? @Spatiotemporal-Exposures-and-Toxicology
Spatiotemporal covariates at site locations, 1-, 10-, and 50-km circular buffers were completed for MOD11, MOD13, and MCD19* variables. I am currently using the "Short Name" in the covariate table with minor changes of zero-padded and five-digit suffixes (e.g., "MOD_NDVIV_0_00000" is the value at sites, "MOD_NDVIV_0_10000" is the mean value at 10-km (10,000 meters) circular buffer, etc.).
VNP46A2 HDF files are not georeferenced; however, we can derive the exact extent in MODIS sinusoidal CRS from other data. I will work on this problem next week.
MOD09GA calculation will be run over the weekend.
MOD06_L2 download is done by entering a csv file from NASA. Using a formatted URL, users can retrieve a list of links that can be used to download the hdf files. download_modis_data()
was revised accordingly.
MOD06_L2 download for five years is in progress.
VNP46A2 processing is expected to be completed around Tuesday (12/12/2023).
Per VNP46A2 coordinate assignment, the night light scene on 01/01/2018 is as below (65535 was replaced with NA
):
MOD06_L2 data is based on a curvilinear grid, which means the data should be rectified before running any further analyses by terra::rectify()
. However, running rectify()
crashes R session with memory segfault error. terra
GitHub repository issue page does not mention any of these. When triton/matern (Linux) and MacOS were tested, the former ended with the same segfault error while the latter did not read hdf file properly. stars::read_stars()
read the data as they are.
I will try reading MOD06_L2 data with terra
using Apptainer image in HPC then try stars-based solutions.
Apptainer trial reproduced the memory segfault error. Stars mosaicking trial was not successful with errors:
> files_mod06 <-
+ list.files(path = "./input/modis/raw/MOD06_L2",
+ pattern = "*.hdf$",
+ full.names = TRUE,
+ recursive = TRUE)
header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- v18001[1:10]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
>
> v18001 <- grep("A2018001", files_mod06, value = T)
> header <- "HDF4_EOS:EOS_SWATH:"
> suffix <- ":mod06:"
> parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
> filename <- v18001[1:10]
> parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
> nt1 <- stars::read_stars(parsinga[1])
nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none
> nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none
> lightmosaicked <-
+ stars::st_mosaic(nt1, nt2)
trying to read file: /tmp/Rtmpjlk49N/file2f1b361e04edf9.tif
Error: file not found
In addition: Warning messages:
1: In CPL_get_metadata(file, NA_character_, options) :
GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648
2: In CPL_read_gdal(as.character(x), as.character(options), as.character(driver), :
GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648
Will try python code instead. It may make problems in testing and package deployment.
@sigmafelix @mitchellmanware Catching up a bit here. With the curvilinear grid, are we not able to just extract the value from MODIS to the point location? If we can do that, we can simplify this covariate and only use the exact grid cell extraction and not more complicated buffers where CRS and grid structure are more impactful.
@sigmafelix @mitchellmanware @dzilber @dawranadeep @eva0marques
I was just made aware of NOAA database for all their data
https://www.noaa.gov/nodd/datasets
It does have a light-at-night product. Perhaps easier to use than the MODIS although I didn't check on the temporal resolution. Temporal resolution may not be a need for that too.
Additionally, to everyone, we may find additional datasets to include as covariates, but let's be judicious about including them. Otherwise, we may never stop calculating covariates!
@Spatiotemporal-Exposures-and-Toxicology Thank you for sharing the list. It is really helpful. Night light products are available in 2018-2022 (and onward) in VIIRS (Colorado School of Mines), which mosaicked all the scenes around the globe monthly. If we will be okay with the monthly temporal resolution, I will work with these data (and download function as well). I think the median or average layer would work for covariate calculation.
Colorado School of Mines VIIRS mosaic looks okay in its annual product, but its monthly products look very noisy.
The dev version of terra
now fixed the rectify()
segfault error. It worked okay with my own Linux laptop, but we have to wait for OSC to figure out the missing system dependency in HPC for building dev version of terra
. Meanwhile, I will use apptainer image instead.
@Spatiotemporal-Exposures-and-Toxicology
All variables except for day/night cloud fractions were calculated at sites
The reference of cloud fractions is on ozone modeling, and its variable calculation is vaguely described. One-day MOD06_L2 images are mosaicked and displayed below, which show the spatially discontinuous and temporally ephemeral (since these images are 5-minute swaths).
Alternative: mosaicking monthly/annual images into one to compute temporally aggregated cloud fractions, but this makes little sense as clouds are short-lived
Question: would these variables, regardless of being temporally aggregated or not, essentially contribute to the model accuracy?
Raw MODIS data except for MOD06_L2 were copied into the team repository in DDN.
Using the python script linked in a comment above, I could get the expected results from swath data:
It is clear that terra
functions for curvilinear grids need to be improved, which is beyond our scope. Perhaps we include this python script into ./tools and call it into the pipeline using reticulate
package.
I think I figured it out with the combination of stars
and terra
. Will continue working this way to calculate cloud coverage.
ddnhome <- "/ddn/gs1/home/songi2"
usrlib <- file.path(ddnhome, "r-libs")
prjhome <- file.path(ddnhome, "projects", "NRTAPModel")
inputdir <- file.path(prjhome, "input")
.libPaths(usrlib)
if (!require(NRTAPmodel)) {
pak::pak("Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel@isong-calc-covars")
library(NRTAPmodel)
}
pkgs <- c("terra", "sf", "foreach", "parallelly", "doParallel", "data.table", "doRNG", "exactextractr")
invisible(sapply(pkgs, library, character.only = TRUE, quietly = TRUE))
sf::sf_use_s2(FALSE)
setwd(prjhome)
source(file.path(prjhome, "R", "calculate_covariates.R"))
source(file.path(inputdir, "Rinput", "processing_functions", "filter_unique_sites.R"))
sites <- filter_unique_sites()
sites_sf <-
sf::st_as_sf(as.data.frame(sites), coords = c("lon", "lat"),
crs = "EPSG:4326")
mod06_dir <- file.path(inputdir, "modis", "raw", "61", "MOD06_L2")
files_mod06 <-
list.files(path = mod06_dir,
pattern = "*.hdf$",
full.names = TRUE,
recursive = TRUE)
header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- files_mod06[1:24]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
rectify_ref_stars <- function(ras) {
ras <- stars::read_stars(ras)
rtd <- stars::st_warp(ras, crs = 4326, cellsize = 0.025)
return(rtd)
}
pps <- split(parsinga, parsinga) |>
lapply(rectify_ref_stars) |>
lapply(terra::rast) |>
Reduce(f = terra::mosaic, x = _)
plot(pps)
@Spatiotemporal-Exposures-and-Toxicology
MOD06_L2 processing is completed. Data is copied to the project directory in DDN (RDS file format). Due to the characteristics of curvilinear grids (as displayed above), there are missing values in extracted data:
── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 ID 0 1 530. 305. 1 265 530. 794 1058 ▇▇▇▇▇
2 MOD_CLCVD_0_00000 17899 0.991 0.580 0.439 0 0.0200 0.780 1.00 1.00 ▅▁▁▁▇
3 MOD_CLCVN_0_00000 71103 0.963 0.627 0.429 0 0.0800 0.920 1.00 1.00 ▅▁▁▁▇
4 MOD_CLCVD_0_01000 17854 0.991 0.580 0.437 0 0.0334 0.771 1 1 ▆▁▁▁▇
5 MOD_CLCVN_0_01000 71067 0.963 0.627 0.427 0 0.103 0.920 1 1 ▅▁▁▁▇
6 MOD_CLCVD_0_10000 17526 0.991 0.578 0.423 0 0.0620 0.730 1.00 1 ▆▁▁▁▇
7 MOD_CLCVN_0_10000 70809 0.963 0.621 0.416 0 0.131 0.859 1.00 1 ▅▁▁▁▇
8 MOD_CLCVD_0_50000 16505 0.991 0.573 0.395 0 0.142 0.662 0.989 1 ▅▂▂▂▇
9 MOD_CLCVN_0_50000 69961 0.964 0.608 0.390 0 0.187 0.750 0.996 1 ▅▂▂▂▇
Missing rates are less than 4 percent. Treating these values is subject to discussions.
MODIS downloading/processing is in progress. Approximately 6 TB of DDN storage will be used.
Remarks
Since the satellite revisits the same area around the same time, temporal filtering will be required in the download code.