Open rsignell-usgs opened 2 years ago
Oh no, it's fractal! The mention of "heap" suggests to me there may be some variable-length structures like strings.
Can you run the failing file if you don't first run the other files - does it always fail at the same point?
@martindurant, very good idea! It was dying on the 14th file, so I tried another range and it died on the 14th file again -- so it has nothing to do with the files! That sounded like throttling, so I googled and I found this: https://luckgrib.com/blog/2021/04/19/throttling.html which says that NOMADS limits requests to 120/minute. I guess this is not good news for using kerchunk, since there are 480 files in the collection. :(
I guess this is not good news for using kerchunk
You could phrase this either way. I would say, that we do have an excellent way to get all the metadata (slowly) and present it at a location not subject to throttling; and thereafter the user can successfully grab subsections of the data very well. Of course, you can't do anything massively parallel here.
@martindurant right -- it's not kerchunk doesn't work well here, it's that by throttling they have made it so that kerchunk can't be performant. I might copy that collection to s3 just to show what's possible....
and it's definitely a throttling issue, since this works:
for url in urls:
gen_json(url)
time.sleep(5)
But if you open using kerchunked references, at least you skip all the HDF5 metadata lookups, so it might well be enough. Of course, S3 will provide better parallel performance, but you pay for that...
I'm trying to kerchunk the medium range netcdf files from the National Water Model, which are only accessible via HTTPS from NOMADS (not S3).
I can successfully a few files and then it bombs out with a cryptic error:
Do you think this is on the provider side or on our end?
Here's a notebook that should reproduce the issue: https://nbviewer.org/gist/rsignell-usgs/2af84b60301a9c7c9a46aaef2b84d2fa