fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
303 stars 78 forks source link

NWM2.1 reanalysis question #117

Open dialuser opened 2 years ago

dialuser commented 2 years ago

Hi, I'm new to kerchunk. I followed the example (see below) to convert the NWM S3Files to Zarr. It worked well for files before 2007. After 2007, I got an error saying file signature not found,

File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 96, in h5py.h5f.open OSError: Unable to open file (file signature not found)

It turns out after 2007 NWM2.1 reanalysis data are not stored as h5 files (they are netCDF files). I wonder if you know of a way of getting around that.

Thanks, Alex

#my code starts here =============================================
from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarr
import fsspec

def testme():
    #u='s3://noaa-nwm-retrospective-2-1-pds/forcing/1996/199602182000.LDASIN_DOMAIN1' #this works
    u='s3://noaa-nwm-retrospective-2-1-pds/forcing/2007/2007010100.LDASIN_DOMAIN1'   # but this does not work
    so = dict(
        mode="rb", anon=True, default_fill_cache=False, default_cache_type="none"
    )
    with fsspec.open(u, **so) as inf:
        print (u)
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)

if __name__ == '__main__':
    testme()
martindurant commented 2 years ago
In [14]: fs = fsspec.filesystem("s3", anon=True)

In [17]: fs.head("s3://noaa-nwm-retrospective-2-1-pds/forcing/2007/2007010100.LDASIN_DOMAIN1", 8)
Out[17]: b'CDF\x01\x00\x00\x00\x00'

It appears to be "classic netCDF CDF-1 format" (see here). That would need a separate conversion class; the file format looks simpler, but I don't know if the old CDF libraries will be as convenient. If the chunking remains the same, there would be, in principle, no problem combining the different file formats into a global kerchunked dataset.

@rsignell-usgs , any idea why it looks like the data format became older in 2007? Or was this some sort of HDF5 (not CDF) -> CDF (not HDF) evolution?

dialuser commented 2 years ago

Just add some extra observations: All post-2007 NWM2.1 files are not only bigger in size (540mb each), but also have slightly different naming conventions (e.g., 10-digit 2007010100 vs. 12-digit 199602182000). In any case, I'd appreciate if the people in this forum can help me find a temporary solution. I've spent several days converting pre-2007 files.

martindurant commented 2 years ago

I don't anticipate having time to implement a netCDF<4 scanner in the near term, but perhaps someone else has? At a guess, the files are much larger because there is no compression; but maybe the chunking is still the same.

rsignell-usgs commented 2 years ago

@dialuser, also note that the NWM2.1 data is already available in Zarr format from https://registry.opendata.aws/nwm-archive/ Specifically:

aws s3 ls s3://noaa-nwm-retrospective-2-1-zarr-pds/ --no-sign-request

The rechunking-and-conversion-to-zarr was done by @jmccreight who would likely be able to answer these questions if necessary.

jmccreight commented 2 years ago

@rsignell-usgs Thanks for pinging me here. @dialuser I changed jobs and had covid, your email fell through the cracks. i was looking for your email recently but could not find it. I had several inquiries on this exact topic, which also confused me.

Thanks for these questions. The answer is that no single conversion process or person produced all the LDASIN files here. I'm not fully up on what was done, but I had some similar (but different) myself when processing the data on NCAR systems.

https://github.com/NCAR/rechunk_retro_nwm_v21/blob/da170bf2af462a4a117ceebc39f751d3ba91ea74/precip/symlink_aorc.py#L18 You can see there are essentially 3 different periods of data with different conventions. (at least it's finite, right?)

I can anecdotally confirm what @martindurant uncovered above

jamesmcc@casper-login2[1017]:/glade/p/cisl/nwc/nwm_forcings/AORC> for ff in $f1 $f2 $f3; do echo $ff: `ncdump -k $ff`;  done
/glade/campaign/ral/hap/zhangyx/AORC.Forcing/2007/200702010000.LDASIN_DOMAIN1: netCDF-4
/glade/p/cisl/nwc/nwm_forcings/AORC/2007020101.LDASIN_DOMAIN1: classic
/glade/p/cisl/nwc/nwm_forcings/AORC/202002010100.LDASIN_DOMAIN1: netCDF-4

I believe that the file size difference is because no compression ("deflate level") is available for classic (against @martindurant pointed out), while _DeflateLevel = 2 is applied for the other, netCDF-4 files (that I looked at). I was surprised to see that there is chunking in the netCDF-4 files: for (time, y, x): _ChunkSizes = 1, 768, 922. It appears there is no chunking in the classic (as far as I can tell).

I dont expect much can be done on the NCAR/NOAA end at this point except to take note that this is a problem. I will connect you with at least one other user who is interested in this data. Perhaps you can collaborate on a solution (I may point them here). It would be nice to see. I honestly did not know that all this forcing data was part of the release, i thought that only the Zarr precip field that I processed was what was released.

martindurant commented 2 years ago

Note on this point:

It appears there is no chunking in the classic (as far as I can tell)

if the blocks are not compressed, then, from a kerchunk point of view, we can pick any chunking we like on the biggest dimension (and second-biggest, if we choose a chunksize of 1 for the biggest), so it may still be possible to get consistency across the different file species.