Download and process MODIS data

sigmafelix commented 9 months ago

MODIS downloading/processing is in progress. Approximately 6 TB of DDN storage will be used.

Total length: 1826 days
MOD09GA 2.0-2.5 GB / day (4.2 TB)
- [x] Download
- [x] Process
1.5 TB for other five
MOD06_L2 500 MB/day (0.9 TB)
- [x] Download
- [x] Process
MOD11A1 80 MB / day (0.16 TB)
- [x] Download
- [x] Process
MCD19A2 160 MB / day (0.32 TB)
- [x] Download
- [x] Process
MOD13A2 300 MB / 16 days (0.05 TB)
- [x] Download
- [x] Process
VNP46A2 110 MB / day (0.28 TB)
- [x] Download
- [x] Process
[x] Encapsulate processing codes into several functions

Remarks

MOD06_L2 is 5-minute cloud product. ~~Since the satellite revisits the same area around the same time~~, temporal filtering will be required in the download code.

sigmafelix commented 9 months ago

Some MODIS products have some missing days in the graticules in the study area. For example, MCD19A2 (Aerosol Optical Depth) is all missing on January 1st, 2022. We will need to impute values on such missing days.

Our study area lies in h07v06 (upper left) ~ h13v03 (lower right). No images are present on January 1st, 2022.

MOD11A1: days 2022284-2022295 are missing.

VNP46A2 products are in HDF5 format without georeferencing. I will deal with this problem in a processing script I am working on.

MODIS download is approximately 50% done now. Data are downloaded directly to my DDN user directory at the moment, and the entire dataset will be moved after the completion of the calculation.

sigmafelix commented 9 months ago

download_modis_data is re-revised to check the NASA archive for folders in each year then retrieve the available dates only. As discussed before, MOD06_L2 (cloud cover) is based on time, which leads to difficulty in selecting scenes at relevant time frames with spatial extents. We need two elements--day and night cloud cover--from the images, but it appears that MOD06_L2 has no such entries (reference). The easiest way is to download all 5-minute products for five years, which takes considerably long time and large amount of storage (10+GB / day; approx. 18TB in ddn storage). I think @mitchellmanware could explain the exact elements we want to use out of dozens of layers in each scene. The covariate table on Teams has no references for these two entries.

mitchellmanware commented 9 months ago

@sigmafelix

Cloud coverage from MODIS satellites, as well as surface reflectance, surface temperature (day and night) were included as covariates in An Ensemble Learning Approach for Estimating High Spatiotemporal Resolution of Ground-Level Ozone in the Contiguous United States (Requia et al., 2020). The full list of specific variables used in this model can be seen in the paper's Supporting Information. I will look into the specific layers used from each satellite, as well as if there is an alternative MODIS product for cloud coverage.

Also, thank you for bringing the "Reference Articles" column to my attention. I will spend time today and tomorrow to ensure that each article is properly referenced for each covariate in the Excel file, and will also create a more formal list of citations.

sigmafelix commented 9 months ago

@mitchellmanware I will look at the SI to identify relevant layers. Thank you.

sigmafelix commented 9 months ago

For the record, MODIS images are tiled; leveraging virtual rasters by terra::vrt helps a lot.

sigmafelix commented 9 months ago

Aerosol products (MCD19A2) have multiple orbit overpasses per subdataset, which result in multiple layers in each subdataset (LPDAAC reference). For example, aerosol optical depth at 0.47 micrometer wavelength typically has four layers a day. Could I take mean values of these layers to represent each day? @Spatiotemporal-Exposures-and-Toxicology

sigmafelix commented 9 months ago

12/01/2023 update

Spatiotemporal covariates at site locations, 1-, 10-, and 50-km circular buffers were completed for MOD11, MOD13, and MCD19* variables. I am currently using the "Short Name" in the covariate table with minor changes of zero-padded and five-digit suffixes (e.g., "MOD_NDVIV_0_00000" is the value at sites, "MOD_NDVIV_0_10000" is the mean value at 10-km (10,000 meters) circular buffer, etc.).

VNP46A2 HDF files are not georeferenced; however, we can derive the exact extent in MODIS sinusoidal CRS from other data. I will work on this problem next week.

MOD09GA calculation will be run over the weekend.

sigmafelix commented 9 months ago

12/08/2023 update

When processing VNP46A2 (night light product), I found that this product has a different grid system than MODIS sinusoidal tiles (VIIRS Grid User Manual, p.8). It is basically based on a cylindrical projection (not sinusoidal), such that all tiles will have "on-earth" coverage. The correct tile range for this dataset is h05v03 - h11v06.
The grid system has an EPSG code of 6933.
Conversion code provided by NASA gives inconsistent information as it assigns the upper left and the lower right corner coordinates in degrees then declares the CRS as EPSG:4326. Since the HDF5 files for this product only include coordinates in degrees, it will be reasonable to assign EPSG:4326.

sigmafelix commented 9 months ago

12/11/2023 update

MOD06_L2 download is done by entering a csv file from NASA. Using a formatted URL, users can retrieve a list of links that can be used to download the hdf files. download_modis_data() was revised accordingly.
MOD06_L2 download for five years is in progress.
VNP46A2 processing is expected to be completed around Tuesday (12/12/2023).
Per VNP46A2 coordinate assignment, the night light scene on 01/01/2018 is as below (65535 was replaced with NA):

sigmafelix commented 8 months ago

MOD06_L2 data is based on a curvilinear grid, which means the data should be rectified before running any further analyses by terra::rectify(). However, running rectify() crashes R session with memory segfault error. terra GitHub repository issue page does not mention any of these. When triton/matern (Linux) and MacOS were tested, the former ended with the same segfault error while the latter did not read hdf file properly. stars::read_stars() read the data as they are. I will try reading MOD06_L2 data with terra using Apptainer image in HPC then try stars-based solutions.

sigmafelix commented 8 months ago

Apptainer trial reproduced the memory segfault error. Stars mosaicking trial was not successful with errors:

> files_mod06 <-
+     list.files(path = "./input/modis/raw/MOD06_L2",
+                pattern = "*.hdf$",
+                full.names = TRUE,
+                recursive = TRUE)
header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- v18001[1:10]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
> 
> v18001 <- grep("A2018001", files_mod06, value = T)
> header <- "HDF4_EOS:EOS_SWATH:"
> suffix <- ":mod06:"
> parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
> filename <- v18001[1:10]
> parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
> nt1 <- stars::read_stars(parsinga[1])
nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none 
> nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none 
> lightmosaicked <-
+ stars::st_mosaic(nt1, nt2)
trying to read file: /tmp/Rtmpjlk49N/file2f1b361e04edf9.tif
Error: file not found
In addition: Warning messages:
1: In CPL_get_metadata(file, NA_character_, options) :
  GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648
2: In CPL_read_gdal(as.character(x), as.character(options), as.character(driver),  :
  GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648

Will try python code instead. It may make problems in testing and package deployment.

kyle-messier commented 8 months ago

@sigmafelix @mitchellmanware Catching up a bit here. With the curvilinear grid, are we not able to just extract the value from MODIS to the point location? If we can do that, we can simplify this covariate and only use the exact grid cell extraction and not more complicated buffers where CRS and grid structure are more impactful.

kyle-messier commented 8 months ago

@sigmafelix @mitchellmanware @dzilber @dawranadeep @eva0marques

I was just made aware of NOAA database for all their data

https://www.noaa.gov/nodd/datasets

It does have a light-at-night product. Perhaps easier to use than the MODIS although I didn't check on the temporal resolution. Temporal resolution may not be a need for that too.

Additionally, to everyone, we may find additional datasets to include as covariates, but let's be judicious about including them. Otherwise, we may never stop calculating covariates!

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology Thank you for sharing the list. It is really helpful. Night light products are available in 2018-2022 (and onward) in VIIRS (Colorado School of Mines), which mosaicked all the scenes around the globe monthly. If we will be okay with the monthly temporal resolution, I will work with these data (and download function as well). I think the median or average layer would work for covariate calculation.

sigmafelix commented 8 months ago

Colorado School of Mines VIIRS mosaic looks okay in its annual product, but its monthly products look very noisy.

The dev version of terra now fixed the rectify() segfault error. It worked okay with my own Linux laptop, but we have to wait for OSC to figure out the missing system dependency in HPC for building dev version of terra. Meanwhile, I will use apptainer image instead.

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology

12/29/2023

All variables except for day/night cloud fractions were calculated at sites
The reference of cloud fractions is on ozone modeling, and its variable calculation is vaguely described. One-day MOD06_L2 images are mosaicked and displayed below, which show the spatially discontinuous and temporally ephemeral (since these images are 5-minute swaths).
Alternative: mosaicking monthly/annual images into one to compute temporally aggregated cloud fractions, but this makes little sense as clouds are short-lived
Question: would these variables, regardless of being temporally aggregated or not, essentially contribute to the model accuracy?

sigmafelix commented 8 months ago

Raw MODIS data except for MOD06_L2 were copied into the team repository in DDN.

sigmafelix commented 8 months ago

Using the python script linked in a comment above, I could get the expected results from swath data:

It is clear that terra functions for curvilinear grids need to be improved, which is beyond our scope. Perhaps we include this python script into ./tools and call it into the pipeline using reticulate package.

sigmafelix commented 8 months ago

I think I figured it out with the combination of stars and terra. Will continue working this way to calculate cloud coverage.

Code tried

ddnhome <- "/ddn/gs1/home/songi2"
usrlib <- file.path(ddnhome, "r-libs")
prjhome <- file.path(ddnhome, "projects", "NRTAPModel")
inputdir <- file.path(prjhome, "input")
.libPaths(usrlib)

if (!require(NRTAPmodel)) {
  pak::pak("Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel@isong-calc-covars")
  library(NRTAPmodel)
}

pkgs <- c("terra", "sf", "foreach", "parallelly", "doParallel", "data.table", "doRNG", "exactextractr")
invisible(sapply(pkgs, library, character.only = TRUE, quietly = TRUE))
sf::sf_use_s2(FALSE)
setwd(prjhome)

source(file.path(prjhome, "R", "calculate_covariates.R"))
source(file.path(inputdir, "Rinput", "processing_functions", "filter_unique_sites.R"))
sites <- filter_unique_sites()
sites_sf <-
  sf::st_as_sf(as.data.frame(sites), coords = c("lon", "lat"),
               crs = "EPSG:4326")

mod06_dir <- file.path(inputdir, "modis", "raw", "61", "MOD06_L2")
files_mod06 <-
  list.files(path = mod06_dir,
             pattern = "*.hdf$",
             full.names = TRUE,
             recursive = TRUE)

header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- files_mod06[1:24]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])

rectify_ref_stars <- function(ras) {
  ras <- stars::read_stars(ras)
  rtd <- stars::st_warp(ras, crs = 4326, cellsize = 0.025)
  return(rtd)
}

pps <- split(parsinga, parsinga) |>
  lapply(rectify_ref_stars) |>
  lapply(terra::rast) |>
  Reduce(f = terra::mosaic, x = _)
plot(pps)

Result

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology

MOD06_L2 processing is completed. Data is copied to the project directory in DDN (RDS file format). Due to the characteristics of curvilinear grids (as displayed above), there are missing values in extracted data:

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate    mean      sd p0      p25     p50     p75    p100 hist 
1 ID                        0         1     530.    305.     1 265      530.    794     1058    ▇▇▇▇▇
2 MOD_CLCVD_0_00000     17899         0.991   0.580   0.439  0   0.0200   0.780   1.00     1.00 ▅▁▁▁▇
3 MOD_CLCVN_0_00000     71103         0.963   0.627   0.429  0   0.0800   0.920   1.00     1.00 ▅▁▁▁▇
4 MOD_CLCVD_0_01000     17854         0.991   0.580   0.437  0   0.0334   0.771   1        1    ▆▁▁▁▇
5 MOD_CLCVN_0_01000     71067         0.963   0.627   0.427  0   0.103    0.920   1        1    ▅▁▁▁▇
6 MOD_CLCVD_0_10000     17526         0.991   0.578   0.423  0   0.0620   0.730   1.00     1    ▆▁▁▁▇
7 MOD_CLCVN_0_10000     70809         0.963   0.621   0.416  0   0.131    0.859   1.00     1    ▅▁▁▁▇
8 MOD_CLCVD_0_50000     16505         0.991   0.573   0.395  0   0.142    0.662   0.989    1    ▅▂▂▂▇
9 MOD_CLCVN_0_50000     69961         0.964   0.608   0.390  0   0.187    0.750   0.996    1    ▅▂▂▂▇

Missing rates are less than 4 percent. Treating these values is subject to discussions.

NIEHS / beethoven