fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
309 stars 80 forks source link

Issue kerchunking NWM medium range netcdf via HTTPS #104

Open rsignell-usgs opened 2 years ago

rsignell-usgs commented 2 years ago

I'm trying to kerchunk the medium range netcdf files from the National Water Model, which are only accessible via HTTPS from NOMADS (not S3).

I can successfully a few files and then it bombs out with a cryptic error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_1913/3841065069.py in <module>
      1 for url in urls:
----> 2     gen_json(url)

/tmp/ipykernel_1913/476567446.py in gen_json(u)
      6         h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
      7         with fs2.open(outf, 'wb') as f:
----> 8             f.write(ujson.dumps(h5chunks.translate()).encode());

/home/conda/store/61285a2505cb2471f92bbdbcc3eccd36f27fdbd1be0fd3e864dac20b9c422482-pangeo/lib/python3.9/site-packages/kerchunk/hdf.py in translate(self)
     71         lggr.debug('Translation begins')
     72         self._transfer_attrs(self._h5f, self._zroot)
---> 73         self._h5f.visititems(self._translator)
     74         if self.inline > 0:
     75             self._do_inline(self.inline)

/home/conda/store/61285a2505cb2471f92bbdbcc3eccd36f27fdbd1be0fd3e864dac20b9c422482-pangeo/lib/python3.9/site-packages/h5py/_hl/group.py in visititems(self, func)
    610                 name = self._d(name)
    611                 return func(name, self[name])
--> 612             return h5o.visit(self.id, proxy)
    613 
    614     @with_phil

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.visit()

RuntimeError: Object visitation failed (wrong fractal heap header signature)

Do you think this is on the provider side or on our end?

Here's a notebook that should reproduce the issue: https://nbviewer.org/gist/rsignell-usgs/2af84b60301a9c7c9a46aaef2b84d2fa

martindurant commented 2 years ago

Oh no, it's fractal! The mention of "heap" suggests to me there may be some variable-length structures like strings.

Can you run the failing file if you don't first run the other files - does it always fail at the same point?

rsignell-usgs commented 2 years ago

@martindurant, very good idea! It was dying on the 14th file, so I tried another range and it died on the 14th file again -- so it has nothing to do with the files! That sounded like throttling, so I googled and I found this: https://luckgrib.com/blog/2021/04/19/throttling.html which says that NOMADS limits requests to 120/minute. I guess this is not good news for using kerchunk, since there are 480 files in the collection. :(

martindurant commented 2 years ago

I guess this is not good news for using kerchunk

You could phrase this either way. I would say, that we do have an excellent way to get all the metadata (slowly) and present it at a location not subject to throttling; and thereafter the user can successfully grab subsections of the data very well. Of course, you can't do anything massively parallel here.

rsignell-usgs commented 2 years ago

@martindurant right -- it's not kerchunk doesn't work well here, it's that by throttling they have made it so that kerchunk can't be performant. I might copy that collection to s3 just to show what's possible....

and it's definitely a throttling issue, since this works:

for url in urls:
    gen_json(url)
    time.sleep(5)
martindurant commented 2 years ago

But if you open using kerchunked references, at least you skip all the HDF5 metadata lookups, so it might well be enough. Of course, S3 will provide better parallel performance, but you pay for that...