Evaluate MODIS HDF4 data for use with titiler-cmr

abarciauskas-bgse commented 4 months ago

Earth.gov, which is an instance of VEDA, has made a request to include MODIS data. It was correctly identified that this dataset, as it is in earthdata cloud, could be a candidate for titiler-cmr. However, I am coming to the conclusion that it won't work since the files are in HDF4 and it is my understanding that any sort of virtual file system (s3fs, vsicurl, vsis3) won't work because the HDF4 library does not implement any abstraction for IO and so it must be read from a local file system using the underlying C library.

The only thing that would work would be to download entire files to read and tile from local storage, which definitely seems like a bad idea. I did test that this works at least for one file, reading using xarray (with rasterio driver), gdal or rasterio. I just want to check that the above conclusion is correct so we can advise the VEDA/earth.gov leads that we may want to take this opportunity to create a cloud-optimized version of this dataset, but that will of course take more time.

A few other notes:

there is a driver for working with HDF4 files: https://github.com/NCAR/pynio but it is in "maintenance mode". I don't think it will work as I think it also assumes local file system storage.
There is a vsipreload library that uses MODIS as an example to enable virtual file IO, however it appears there are operating system requirements (Linux glibc ONLY) so I would need to launch a virtual instance to test it out.

cc @vincentsarago @wildintellect @sharkinsspatial

wildintellect commented 4 months ago

@abarciauskas-bgse which MODIS products, there are many .... and many are already on WorldView why wouldn't we just tap into the existing web services.

I'm also not so sure about the won't work with S3FS, do you mean not work at all, or simply perform poorly?

This is a MODIS product, @chuckwondo and I tested this yesterday, note we used the h5netcdf driver to read the data not the netcdf4 library.

import xarray as xr
import fsspec

sample_file = 's3://lp-prod-protected/VJ114IMG.002/VJ114IMG.A2024198.0942.002.2024198150736/VJ114IMG.A2024198.0942.002.2024198150736.nc'
s3_fsspec = fsspec.filesystem("s3", profile="maap-data-reader")
test = xr.open_dataset(s3_fsspec.open(sample_file), engine="h5netcdf", phony_dims='sort')

https://github.com/orgs/MAAP-Project/discussions/1031#discussioncomment-10067291

chuckwondo commented 4 months ago

FYI, when phony_dims is required, I recommend using "access", not "sort". Using "access" applies the phony dims only upon access of particular arrays, whereas "sort" will apply phone dims across the entire hierarchy (if I understand correctly), which means reading all metadata through the hierarchy, whether you need to or not.

abarciauskas-bgse commented 4 months ago

@wildintellect thanks for looking at this issue.

The MODIS product I am evaluating is MCD12Q1 v061: MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500 m SIN Grid. I just checked and using the h5netdf engine with phony dims also does not work for this dataset: https://gist.github.com/abarciauskas-bgse/8d967af117793bead9395020d8c22c48

You can see the error is

ValueError: b'\x0e\x03\x13\x01\x00\x10\x00\x00' is not the signature of a valid netCDF4 file

which is raised from this line: https://github.com/pydata/xarray/blob/main/xarray/backends/h5netcdf_.py#L161 which indicates to me that this product is not a valid NetCDF file. This I think is expected since the file driver is HDF4/Hierarchical Data Format Release 4 whereas for the VIIRS product it is netCDF/Network Common Data Format.

There is a very similar MODIS product on worldview: https://go.nasa.gov/3zPxN9O. Unfortunately, that product only goes through 2019 and it appears it was decommissioned: MCD12Q1 v006 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500 m SIN Grid pointing users to the newer product which is the same one we are investigating here.

Aside: From a bit more reading of the user guide it appears this is an annual product where each granule represents a different spatial extent.

wildintellect commented 4 months ago

@abarciauskas-bgse interesting, if it's an annual product, then there probably aren't that many granules which means the best option might be to convert the data format.

abarciauskas-bgse commented 4 months ago

well it's 315 granules a year (I think) and about 20 years: 6,300 (actual number is 6,930). But I was implying in that comment that I do think creating annual COGs or a zarr dataset would be interesting.

abarciauskas-bgse commented 4 months ago

To wrap up this ticket I will write up a list of options to discuss with VEDA and earth.gov leads

maxrjones commented 3 months ago

a few other conversations of various ages that support your conclusion, Aimee:

developmentseed / titiler-cmr

Evaluate MODIS HDF4 data for use with titiler-cmr #25