developmentseed / titiler-cmr

Dynamic tiles from CMR queries
MIT License
5 stars 0 forks source link

Evaluate MODIS HDF4 data for use with titiler-cmr #25

Open abarciauskas-bgse opened 1 month ago

abarciauskas-bgse commented 1 month ago

Earth.gov, which is an instance of VEDA, has made a request to include MODIS data. It was correctly identified that this dataset, as it is in earthdata cloud, could be a candidate for titiler-cmr. However, I am coming to the conclusion that it won't work since the files are in HDF4 and it is my understanding that any sort of virtual file system (s3fs, vsicurl, vsis3) won't work because the HDF4 library does not implement any abstraction for IO and so it must be read from a local file system using the underlying C library.

The only thing that would work would be to download entire files to read and tile from local storage, which definitely seems like a bad idea. I did test that this works at least for one file, reading using xarray (with rasterio driver), gdal or rasterio. I just want to check that the above conclusion is correct so we can advise the VEDA/earth.gov leads that we may want to take this opportunity to create a cloud-optimized version of this dataset, but that will of course take more time.

A few other notes:

cc @vincentsarago @wildintellect @sharkinsspatial

wildintellect commented 1 month ago

@abarciauskas-bgse which MODIS products, there are many .... and many are already on WorldView why wouldn't we just tap into the existing web services.

I'm also not so sure about the won't work with S3FS, do you mean not work at all, or simply perform poorly?

This is a MODIS product, @chuckwondo and I tested this yesterday, note we used the h5netcdf driver to read the data not the netcdf4 library.

import xarray as xr
import fsspec

sample_file = 's3://lp-prod-protected/VJ114IMG.002/VJ114IMG.A2024198.0942.002.2024198150736/VJ114IMG.A2024198.0942.002.2024198150736.nc'
s3_fsspec = fsspec.filesystem("s3", profile="maap-data-reader")
test = xr.open_dataset(s3_fsspec.open(sample_file), engine="h5netcdf", phony_dims='sort')

https://github.com/orgs/MAAP-Project/discussions/1031#discussioncomment-10067291

chuckwondo commented 1 month ago

FYI, when phony_dims is required, I recommend using "access", not "sort". Using "access" applies the phony dims only upon access of particular arrays, whereas "sort" will apply phone dims across the entire hierarchy (if I understand correctly), which means reading all metadata through the hierarchy, whether you need to or not.

abarciauskas-bgse commented 1 month ago

@wildintellect thanks for looking at this issue.

The MODIS product I am evaluating is MCD12Q1 v061: MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500 m SIN Grid. I just checked and using the h5netdf engine with phony dims also does not work for this dataset: https://gist.github.com/abarciauskas-bgse/8d967af117793bead9395020d8c22c48

You can see the error is

ValueError: b'\x0e\x03\x13\x01\x00\x10\x00\x00' is not the signature of a valid netCDF4 file

which is raised from this line: https://github.com/pydata/xarray/blob/main/xarray/backends/h5netcdf_.py#L161 which indicates to me that this product is not a valid NetCDF file. This I think is expected since the file driver is HDF4/Hierarchical Data Format Release 4 whereas for the VIIRS product it is netCDF/Network Common Data Format.

There is a very similar MODIS product on worldview: https://go.nasa.gov/3zPxN9O. Unfortunately, that product only goes through 2019 and it appears it was decommissioned: MCD12Q1 v006 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500 m SIN Grid pointing users to the newer product which is the same one we are investigating here.

Aside: From a bit more reading of the user guide it appears this is an annual product where each granule represents a different spatial extent.

wildintellect commented 1 month ago

@abarciauskas-bgse interesting, if it's an annual product, then there probably aren't that many granules which means the best option might be to convert the data format.

abarciauskas-bgse commented 1 month ago

well it's 315 granules a year (I think) and about 20 years: 6,300 (actual number is 6,930). But I was implying in that comment that I do think creating annual COGs or a zarr dataset would be interesting.

abarciauskas-bgse commented 1 month ago

To wrap up this ticket I will write up a list of options to discuss with VEDA and earth.gov leads

maxrjones commented 1 month ago

a few other conversations of various ages that support your conclusion, Aimee: