NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

Download and process MODIS data #197

Closed sigmafelix closed 8 months ago

sigmafelix commented 9 months ago

MODIS downloading/processing is in progress. Approximately 6 TB of DDN storage will be used.

Remarks

sigmafelix commented 9 months ago

Some MODIS products have some missing days in the graticules in the study area. For example, MCD19A2 (Aerosol Optical Depth) is all missing on January 1st, 2022. We will need to impute values on such missing days.

image

Our study area lies in h07v06 (upper left) ~ h13v03 (lower right). No images are present on January 1st, 2022.

MOD11A1: days 2022284-2022295 are missing.

VNP46A2 products are in HDF5 format without georeferencing. I will deal with this problem in a processing script I am working on.


MODIS download is approximately 50% done now. Data are downloaded directly to my DDN user directory at the moment, and the entire dataset will be moved after the completion of the calculation.

sigmafelix commented 9 months ago

download_modis_data is re-revised to check the NASA archive for folders in each year then retrieve the available dates only. As discussed before, MOD06_L2 (cloud cover) is based on time, which leads to difficulty in selecting scenes at relevant time frames with spatial extents. We need two elements--day and night cloud cover--from the images, but it appears that MOD06_L2 has no such entries (reference). The easiest way is to download all 5-minute products for five years, which takes considerably long time and large amount of storage (10+GB / day; approx. 18TB in ddn storage). I think @mitchellmanware could explain the exact elements we want to use out of dozens of layers in each scene. The covariate table on Teams has no references for these two entries.

image
mitchellmanware commented 9 months ago

@sigmafelix

Cloud coverage from MODIS satellites, as well as surface reflectance, surface temperature (day and night) were included as covariates in An Ensemble Learning Approach for Estimating High Spatiotemporal Resolution of Ground-Level Ozone in the Contiguous United States (Requia et al., 2020). The full list of specific variables used in this model can be seen in the paper's Supporting Information. I will look into the specific layers used from each satellite, as well as if there is an alternative MODIS product for cloud coverage.

Also, thank you for bringing the "Reference Articles" column to my attention. I will spend time today and tomorrow to ensure that each article is properly referenced for each covariate in the Excel file, and will also create a more formal list of citations.

sigmafelix commented 9 months ago

@mitchellmanware I will look at the SI to identify relevant layers. Thank you.

sigmafelix commented 9 months ago

For the record, MODIS images are tiled; leveraging virtual rasters by terra::vrt helps a lot.

sigmafelix commented 9 months ago

Aerosol products (MCD19A2) have multiple orbit overpasses per subdataset, which result in multiple layers in each subdataset (LPDAAC reference). For example, aerosol optical depth at 0.47 micrometer wavelength typically has four layers a day. Could I take mean values of these layers to represent each day? @Spatiotemporal-Exposures-and-Toxicology

sigmafelix commented 9 months ago

12/01/2023 update

Spatiotemporal covariates at site locations, 1-, 10-, and 50-km circular buffers were completed for MOD11, MOD13, and MCD19* variables. I am currently using the "Short Name" in the covariate table with minor changes of zero-padded and five-digit suffixes (e.g., "MOD_NDVIV_0_00000" is the value at sites, "MOD_NDVIV_0_10000" is the mean value at 10-km (10,000 meters) circular buffer, etc.).

VNP46A2 HDF files are not georeferenced; however, we can derive the exact extent in MODIS sinusoidal CRS from other data. I will work on this problem next week.

MOD09GA calculation will be run over the weekend.

sigmafelix commented 9 months ago

12/08/2023 update

sigmafelix commented 9 months ago

12/11/2023 update

sigmafelix commented 8 months ago

MOD06_L2 data is based on a curvilinear grid, which means the data should be rectified before running any further analyses by terra::rectify(). However, running rectify() crashes R session with memory segfault error. terra GitHub repository issue page does not mention any of these. When triton/matern (Linux) and MacOS were tested, the former ended with the same segfault error while the latter did not read hdf file properly. stars::read_stars() read the data as they are. I will try reading MOD06_L2 data with terra using Apptainer image in HPC then try stars-based solutions.

sigmafelix commented 8 months ago

Apptainer trial reproduced the memory segfault error. Stars mosaicking trial was not successful with errors:

> files_mod06 <-
+     list.files(path = "./input/modis/raw/MOD06_L2",
+                pattern = "*.hdf$",
+                full.names = TRUE,
+                recursive = TRUE)
header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- v18001[1:10]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
> 
> v18001 <- grep("A2018001", files_mod06, value = T)
> header <- "HDF4_EOS:EOS_SWATH:"
> suffix <- ":mod06:"
> parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
> filename <- v18001[1:10]
> parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])
> nt1 <- stars::read_stars(parsinga[1])
nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none 
> nt2 <- stars::read_stars(parsinga[2])
Warning message:
ignoring unrecognized unit: none 
> lightmosaicked <-
+ stars::st_mosaic(nt1, nt2)
trying to read file: /tmp/Rtmpjlk49N/file2f1b361e04edf9.tif
Error: file not found
In addition: Warning messages:
1: In CPL_get_metadata(file, NA_character_, options) :
  GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648
2: In CPL_read_gdal(as.character(x), as.character(options), as.character(driver),  :
  GDAL Error 1: Invalid dataset dimensions : -2147483648 x -2147483648

Will try python code instead. It may make problems in testing and package deployment.

kyle-messier commented 8 months ago

@sigmafelix @mitchellmanware Catching up a bit here. With the curvilinear grid, are we not able to just extract the value from MODIS to the point location? If we can do that, we can simplify this covariate and only use the exact grid cell extraction and not more complicated buffers where CRS and grid structure are more impactful.

kyle-messier commented 8 months ago

@sigmafelix @mitchellmanware @dzilber @dawranadeep @eva0marques

I was just made aware of NOAA database for all their data

https://www.noaa.gov/nodd/datasets

It does have a light-at-night product. Perhaps easier to use than the MODIS although I didn't check on the temporal resolution. Temporal resolution may not be a need for that too.

Additionally, to everyone, we may find additional datasets to include as covariates, but let's be judicious about including them. Otherwise, we may never stop calculating covariates!

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology Thank you for sharing the list. It is really helpful. Night light products are available in 2018-2022 (and onward) in VIIRS (Colorado School of Mines), which mosaicked all the scenes around the globe monthly. If we will be okay with the monthly temporal resolution, I will work with these data (and download function as well). I think the median or average layer would work for covariate calculation.

sigmafelix commented 8 months ago

Colorado School of Mines VIIRS mosaic looks okay in its annual product, but its monthly products look very noisy.

The dev version of terra now fixed the rectify() segfault error. It worked okay with my own Linux laptop, but we have to wait for OSC to figure out the missing system dependency in HPC for building dev version of terra. Meanwhile, I will use apptainer image instead.

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology

12/29/2023

sigmafelix commented 8 months ago

Raw MODIS data except for MOD06_L2 were copied into the team repository in DDN.

sigmafelix commented 8 months ago

Using the python script linked in a comment above, I could get the expected results from swath data: image

It is clear that terra functions for curvilinear grids need to be improved, which is beyond our scope. Perhaps we include this python script into ./tools and call it into the pipeline using reticulate package.

sigmafelix commented 8 months ago

I think I figured it out with the combination of stars and terra. Will continue working this way to calculate cloud coverage.

Code tried

ddnhome <- "/ddn/gs1/home/songi2"
usrlib <- file.path(ddnhome, "r-libs")
prjhome <- file.path(ddnhome, "projects", "NRTAPModel")
inputdir <- file.path(prjhome, "input")
.libPaths(usrlib)

if (!require(NRTAPmodel)) {
  pak::pak("Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel@isong-calc-covars")
  library(NRTAPmodel)
}

pkgs <- c("terra", "sf", "foreach", "parallelly", "doParallel", "data.table", "doRNG", "exactextractr")
invisible(sapply(pkgs, library, character.only = TRUE, quietly = TRUE))
sf::sf_use_s2(FALSE)
setwd(prjhome)

source(file.path(prjhome, "R", "calculate_covariates.R"))
source(file.path(inputdir, "Rinput", "processing_functions", "filter_unique_sites.R"))
sites <- filter_unique_sites()
sites_sf <-
  sf::st_as_sf(as.data.frame(sites), coords = c("lon", "lat"),
               crs = "EPSG:4326")

mod06_dir <- file.path(inputdir, "modis", "raw", "61", "MOD06_L2")
files_mod06 <-
  list.files(path = mod06_dir,
             pattern = "*.hdf$",
             full.names = TRUE,
             recursive = TRUE)

header <- "HDF4_EOS:EOS_SWATH:"
suffix <- ":mod06:"
parsing <- c("Cloud_Fraction_Day", "Cloud_Fraction_Night")
filename <- files_mod06[1:24]
parsinga <- sprintf("%s%s%s%s", header, filename, suffix, parsing[2])

rectify_ref_stars <- function(ras) {
  ras <- stars::read_stars(ras)
  rtd <- stars::st_warp(ras, crs = 4326, cellsize = 0.025)
  return(rtd)
}

pps <- split(parsinga, parsinga) |>
  lapply(rectify_ref_stars) |>
  lapply(terra::rast) |>
  Reduce(f = terra::mosaic, x = _)
plot(pps)

Result

image

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology

MOD06_L2 processing is completed. Data is copied to the project directory in DDN (RDS file format). Due to the characteristics of curvilinear grids (as displayed above), there are missing values in extracted data:

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate    mean      sd p0      p25     p50     p75    p100 hist 
1 ID                        0         1     530.    305.     1 265      530.    794     1058    ▇▇▇▇▇
2 MOD_CLCVD_0_00000     17899         0.991   0.580   0.439  0   0.0200   0.780   1.00     1.00 ▅▁▁▁▇
3 MOD_CLCVN_0_00000     71103         0.963   0.627   0.429  0   0.0800   0.920   1.00     1.00 ▅▁▁▁▇
4 MOD_CLCVD_0_01000     17854         0.991   0.580   0.437  0   0.0334   0.771   1        1    ▆▁▁▁▇
5 MOD_CLCVN_0_01000     71067         0.963   0.627   0.427  0   0.103    0.920   1        1    ▅▁▁▁▇
6 MOD_CLCVD_0_10000     17526         0.991   0.578   0.423  0   0.0620   0.730   1.00     1    ▆▁▁▁▇
7 MOD_CLCVN_0_10000     70809         0.963   0.621   0.416  0   0.131    0.859   1.00     1    ▅▁▁▁▇
8 MOD_CLCVD_0_50000     16505         0.991   0.573   0.395  0   0.142    0.662   0.989    1    ▅▂▂▂▇
9 MOD_CLCVN_0_50000     69961         0.964   0.608   0.390  0   0.187    0.750   0.996    1    ▅▂▂▂▇

Missing rates are less than 4 percent. Treating these values is subject to discussions.