NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

NLCD extraction error related to Erdas Imagine file formats on ddn #204

Closed sigmafelix closed 9 months ago

sigmafelix commented 9 months ago

NLCD data provided from the official website is offered in zip files, where an Erdas Imagine file and its auxiliaries (ige, rrd, etc.) are compressed. When running exactextractr::exact_extract with these files on ddn using code at HPC (i.e., triton) below, the results are always a one-column data frame:

library(terra)
library(sf)
library(exactextractr)
options(sf_use_s2 = FALSE)

ext0 <- c(xmin = -2000000, xmax = -1500000, ymin = 1500000, ymax = 2000000)
ext0 <- terra::ext(ext0)
nlcd_small <- terra::rast("/ddn/gs1/group/set/Projects/NRT-AP-Model/input/nlcd/raw/nlcd_2021_land_cover_l48_20230630.img",
    win = ext0)

ssampbsf <- sf::st_read("__pregenerated_buffer_.gpkg")

ssampext <- exactextractr::exact_extract(
    nlcd_small,
    ssampbsf,
    fun = "frac",
    force_df = TRUE
)

head(ssampext)

#  |======================================================================| 100%
#  frac_0
# 1      1
# 2      1
# 3      1
# 4      1
# 5      1
# 6      1

However, when the same code (with path modification) in the local system, the results were as expected.

#   |======================================================================| 100%
#        frac_11 frac_12      frac_21      frac_22      frac_23      frac_24
# 1 5.753208e-06       0 0.0001813312 8.629811e-06 0.000000e+00 0.000000e+00
# 2 2.076172e-05       0 0.0015462684 2.761540e-03 1.136747e-03 1.467068e-04
# 3 0.000000e+00       0 0.0036744867 8.529470e-04 1.150642e-04 3.739585e-05
# 4 0.000000e+00       0 0.0002042396 3.336051e-03 3.360826e-03 8.054491e-05
# 5 0.000000e+00       0 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
# 6 0.000000e+00       0 0.0007550599 6.991125e-04 2.935899e-05 0.000000e+00
#        frac_31     frac_41      frac_42    frac_43   frac_52      frac_71
# 1 5.753208e-05 0.000000000 4.766394e-01 0.00000000 0.5218816 0.0012113617
# 2 1.857006e-01 0.000000000 8.629811e-06 0.00000000 0.8069571 0.0002286142
# 3 1.980251e-02 0.000000000 5.526067e-02 0.00000000 0.9095345 0.0107224332
# 4 5.580612e-04 0.006336919 2.489696e-01 0.01026623 0.7175284 0.0019115502
# 5 2.477963e-04 0.000000000 1.110742e-01 0.00000000 0.8674962 0.0210179109
# 6 7.110124e-03 0.000000000 0.000000e+00 0.00000000 0.7270551 0.2643512785
#   frac_81 frac_82     frac_90      frac_95
# 1       0       0 0.000000000 1.438302e-05
# 2       0       0 0.001475698 1.725962e-05
# 3       0       0 0.000000000 0.000000e+00
# 4       0       0 0.004418463 3.029064e-03
# 5       0       0 0.000000000 1.639664e-04
# 6       0       0 0.000000000 0.000000e+00

This is possibly due to the file system and technical specification of Erdas Imagine file format, seeing the file size difference in local and ddn:

Local

image

DDN

image

I also tried downloading NLCD zip file directly from the webpage to my local then uploaded unzipped files to the ddn, but the results were the same.

All problems considered, I converted the .img file(s) into a GeoTIFF file using gdal_translate nlcd_2019_...img nlcd_2019_...tif and uploaded the .tif file to ddn, the code above worked as expected:

#   |======================================================================| 100%
#        frac_11 frac_12      frac_21      frac_22      frac_23      frac_24
# 1 5.753208e-06       0 0.0001813312 8.629811e-06 0.000000e+00 0.000000e+00
# 2 2.076172e-05       0 0.0015462684 2.761540e-03 1.136747e-03 1.467068e-04
# 3 0.000000e+00       0 0.0036744867 8.529470e-04 1.150642e-04 3.739585e-05
# 4 0.000000e+00       0 0.0002042396 3.336051e-03 3.360826e-03 8.054491e-05
# 5 0.000000e+00       0 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
# 6 0.000000e+00       0 0.0007550599 6.991125e-04 2.935899e-05 0.000000e+00
#        frac_31     frac_41      frac_42    frac_43   frac_52      frac_71
# 1 5.753208e-05 0.000000000 4.766394e-01 0.00000000 0.5218816 0.0012113617
# 2 1.857006e-01 0.000000000 8.629811e-06 0.00000000 0.8069571 0.0002286142
# 3 1.980251e-02 0.000000000 5.526067e-02 0.00000000 0.9095345 0.0107224332
# 4 5.580612e-04 0.006336919 2.489696e-01 0.01026623 0.7175284 0.0019115502
# 5 2.477963e-04 0.000000000 1.110742e-01 0.00000000 0.8674962 0.0210179109
# 6 7.110124e-03 0.000000000 0.000000e+00 0.00000000 0.7270551 0.2643512785
#   frac_81 frac_82     frac_90      frac_95
# 1       0       0 0.000000000 1.438302e-05
# 2       0       0 0.001475698 1.725962e-05
# 3       0       0 0.000000000 0.000000e+00
# 4       0       0 0.004418463 3.029064e-03
# 5       0       0 0.000000000 1.639664e-04
# 6       0       0 0.000000000 0.000000e+00

A NLCD preprocessing function or an additional part in the NLCD download function needs to be added to convert .img file to .tif file. A potential problem related to this is that DDN does not have gdal-bin package, so the installation requires an approval by OSC. We might use apptainer container for this task instead.

This issue is related to @eva0marques and @mitchellmanware . I suggest @Spatiotemporal-Exposures-and-Toxicology adding this as an agendum for the next meeting.

mitchellmanware commented 9 months ago

@sigmafelix

This is good to know. Yesterday @eva0marques sent me an R package titled FedData. This package includes a function for downloading NLCD data. I will investigate the functions data source and methodology while waiting for discussion with OSC about gdal-bin package.

sigmafelix commented 9 months ago

I figured it out with Frank in OSC to install gdal utilities in the highmem partitions and triton. gdal_translate command will work in both, so we will be fine to write a few lines of script to convert Erdas Imagine file into GeoTIFF. Availability in geo cluster is pending. One potential problem to think about in the near future is to configure GitHub Action runner with gdal utilities to make all (future) tests for a pipeline pass.

sigmafelix commented 9 months ago

Even converting .img file into .tif with gdal_translate on Triton, I still find the strange behavior of returning a single-column data.frame. I could not figure out what exactly causes the problem. Perhaps we need to put a warning message to convert NLCD .img file into .tif using gdal_translate locally.

sigmafelix commented 9 months ago

Additional tests: in geo cluster, I converted .img to .tif: got the same erroneous results. The next experiment is to convert the file to a scratch folder in geo cluster. I heard from OSC that scratch space in geo cluster has been fixed to be accessible for users. Perhaps this issue is trivial, but I will experiment several approaches to identify the exact cause of the issue. I already ruled out terra version issues after I tried 1.7.46 and 1.7.55 separately without issues in local.

Since this issue is not urgent, I will try these experiments time to time and share results until next Monday (12/18/2023).

sigmafelix commented 9 months ago

I tried downloading directly to a ddn location in triton:

wget -O nlcd_2021.zip https://s3-us-west-2.amazonaws.com/mrlc/nlcd_2021_land_cover_l48_20230630.zip
mkdir nlcd2021_test
unzip -t nlcd2021_test nlcd_2021.zip

Used same script above with the unzipped file. Then I found

Cannot preload entire working area of 300294205 cells with max_cells_in_memory = 3e+07. Raster values will be read for each feature individually.
  |======================================================================| 100%
       frac_11 frac_12      frac_21      frac_22      frac_23      frac_24
1 5.753208e-06       0 0.0001813312 8.629811e-06 0.000000e+00 0.000000e+00
2 3.514474e-05       0 0.0014484638 2.755787e-03 1.240304e-03 1.467068e-04
3 0.000000e+00       0 0.0036716100 8.414406e-04 1.294472e-04 3.739585e-05
4 0.000000e+00       0 0.0002042396 3.333175e-03 3.363703e-03 8.054491e-05
5 0.000000e+00       0 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
6 0.000000e+00       0 0.0007550599 6.991125e-04 2.935899e-05 0.000000e+00
       frac_31     frac_41      frac_42    frac_43   frac_52      frac_71
1 5.753208e-05 0.000000000 4.899567e-01 0.00000000 0.5089095 0.0008661693
2 1.794627e-01 0.000000000 8.629811e-06 0.00000000 0.8131807 0.0002286142
3 2.038847e-02 0.000000000 5.535273e-02 0.00000000 0.9092591 0.0103197088
4 4.602566e-04 0.006336919 2.493024e-01 0.01026623 0.7172992 0.0019057969
5 2.880687e-04 0.000000000 1.111000e-01 0.00000000 0.8674674 0.0209805164
6 1.542041e-03 0.000000000 0.000000e+00 0.00000000 0.7326260 0.2643483877
  frac_81 frac_82     frac_90      frac_95
1       0       0 0.000000000 1.438302e-05
2       0       0 0.001475698 1.725962e-05
3       0       0 0.000000000 0.000000e+00
4       0       0 0.004418463 3.029064e-03
5       0       0 0.000000000 1.639664e-04
6       0       0 0.000000000 0.000000e+00
       frac_11 frac_12      frac_21      frac_22      frac_23      frac_24
1 5.753208e-06       0 0.0001813312 8.629811e-06 0.000000e+00 0.000000e+00
2 3.514474e-05       0 0.0014484638 2.755787e-03 1.240304e-03 1.467068e-04
3 0.000000e+00       0 0.0036716100 8.414406e-04 1.294472e-04 3.739585e-05
4 0.000000e+00       0 0.0002042396 3.333175e-03 3.363703e-03 8.054491e-05
5 0.000000e+00       0 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00
6 0.000000e+00       0 0.0007550599 6.991125e-04 2.935899e-05 0.000000e+00
       frac_31     frac_41      frac_42    frac_43   frac_52      frac_71
1 5.753208e-05 0.000000000 4.899567e-01 0.00000000 0.5089095 0.0008661693
2 1.794627e-01 0.000000000 8.629811e-06 0.00000000 0.8131807 0.0002286142
3 2.038847e-02 0.000000000 5.535273e-02 0.00000000 0.9092591 0.0103197088
4 4.602566e-04 0.006336919 2.493024e-01 0.01026623 0.7172992 0.0019057969
5 2.880687e-04 0.000000000 1.111000e-01 0.00000000 0.8674674 0.0209805164
6 1.542041e-03 0.000000000 0.000000e+00 0.00000000 0.7326260 0.2643483877
  frac_81 frac_82     frac_90      frac_95
1       0       0 0.000000000 1.438302e-05
2       0       0 0.001475698 1.725962e-05
3       0       0 0.000000000 0.000000e+00
4       0       0 0.004418463 3.029064e-03
5       0       0 0.000000000 1.639664e-04
6       0       0 0.000000000 0.000000e+00

I think the issue was just because of a corrupted zip file.

sigmafelix commented 9 months ago

In the long run, there needs a script that verifies the downloaded file is identical to the original in the server (e.g., using checksum, sha256, etc.). FYI, the next NLCD release is expected to be in 2025.

I will close this issue.