fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
286 stars 79 forks source link

"OSError: [Errno 24] Too many open files" when using `MultiZarrToZarr` #214

Open leifdenby opened 1 year ago

leifdenby commented 1 year ago

First, thanks for a great project!

I work with output from a Large-Eddy Simulation model (UCLA-LES) which decomposes the 3D spatial domain into horizontally separated columns so that each MPI core only handles the fluid evolution within a small horizontal subdomain and each core writes a separate netCDF file for that subdomain. This means that once the simulation is done I need to combine these individual column-files in some way.

Using kerchunk I've been writing a little command line utility (https://github.com/leifdenby/uclales-zarr) to produce a zarr-archive (json) file that describes the whole simulation output without the need to merge all these files into a single netCDF file. I've gotten this working when I have a small number of source files (a 10x10 grid of columns, so 100 source netCDF files) and it works great to later load in with xarray(!) But with a larger simulation with 64x64=4096 columns I get an the OSError: Too many open files exception being raised. Have others come across this issue? Am I missing something obvious? I had a quick glance inside MultiZarrToZarr.fss inside of kerchunk (https://github.com/fsspec/kerchunk/blob/main/kerchunk/combine.py#L147) and it looks like the json files are simply looped over, is there maybe a way to ensure the files are closed here while looping through? Stacktrace below.

Sorry if this isn't the intended use of kerchunk :innocent:, I thought I'd give it a try and was wondering if there's an easy fix :smile:

Thanks again!

PS. Thanks @martindurant for adding netcdf3 support recently. Some of my models output files are written in netCDF3 format. I'm in the process of changing that, but we have quite a lot of old simulations lying around that use netCDF3, so it's nice to having to convert to netCDF4 first.

~/datastore/a289/LES_datasets/uclales » python -m uclales_zarr rico_raw_data_2048 rico_gcss --data-kind 3d                                                                   
2022-08-22 07:43:39.865 | INFO     | uclales_zarr.uclales_zarr:_find_source_files:117 - Found 4096 soure files
2022-08-22 07:43:39.921 | INFO     | uclales_zarr.uclales_zarr:_create_singlefile_zarr_jsons:40 - Creating JSON file for each individual source NetCDF file
[########################################] | 100% Completed | 23.60 s
2022-08-22 07:44:03.857 | INFO     | uclales_zarr.uclales_zarr:_multizarr_to_zarr:81 - Writing single-file json zarr descriptor to `rico_raw_data_2048__zarr/rico_gcss.json`
OSError(24, 'Too many open files')
Traceback (most recent call last):
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/ipdb/__main__.py", line 222, in launch_ipdb_on_exception
  File "/nfs/see-fs-02_users/earlcd/git-repos/uclales-zarr/uclales_zarr/__main__.py", line 18, in <module>
  File "/nfs/see-fs-02_users/earlcd/git-repos/uclales-zarr/uclales_zarr/uclales_zarr.py", line 157, in main
  File "/nfs/see-fs-02_users/earlcd/git-repos/uclales-zarr/uclales_zarr/uclales_zarr.py", line 90, in _multizarr_to_zarr
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/kerchunk/combine.py", line 445, in translate
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/kerchunk/combine.py", line 226, in first_pass
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/kerchunk/combine.py", line 149, in fss
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/core.py", line 141, in open
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/core.py", line 104, in __enter__
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/spec.py", line 1037, in open
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/implementations/local.py", line 159, in _open
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/implementations/local.py", line 254, in __init__
  File "/nfs/see-fs-02_users/earlcd/miniconda3/envs/kerchunk/lib/python3.8/site-packages/fsspec/implementations/local.py", line 259, in _open
OSError: [Errno 24] Too many open files: '/nfs/a289/earlcd/LES_datasets/uclales/rico_raw_data_2048__zarr/src_jsons__3d/rico_gcss.00150057.nc.json'
martindurant commented 1 year ago

Your traceback doesn't quite line up with the current state of the code. Would you mind trying to find exactly which line in kerchunk.combine caused this? Note that on linux and mac, it is easy to change the allowed number of open files with ulimit, which is by default rather small particularly on mac. 4096 is indeed a lot of input JSON files, but hardly extreme.

leifdenby commented 1 year ago

The entry in my code that calls MultiZarrToZarr.translate(...) is here: https://github.com/leifdenby/uclales-zarr/blob/master/uclales_zarr/uclales_zarr.py#L90. The kerchunk and fsspec versions are:

kerchunk==0.0.7
fsspec==2022.5.0
fsspec-reference-maker==0.0.4

Looking at the kerchunk==0.0.7 source tree this seems to be where kerchunk calles fsspec: https://github.com/fsspec/kerchunk/blob/0.0.7/kerchunk/combine.py#L149

The hard limit on number of open files on the system I'm using is 4096.

$> ulimit -Hn
4096

And I don't have root access so I can't increase above this value.

Shouldn't it be possible to avoid having all the json-files open at once and instead open/close them one-by-one?

keewis commented 1 year ago

you can work around this by manually opening and json-decoding the files, and then passing the resulting list of dict to MultiZarrToZarr:

fs, _ = fsspec.core.url_to_fs(urls[0], **target_options)
refs = [ujson.loads(c) for c in fs.cat(urls).values()]
mzz = MultiZarrToZarr(refs, ...)

The version on main should not have this issue anymore since it uses fs.cat (assuming fsspec.open_files does not actually open the local files) so we'd just have to wait on the next release.

lsterzinger commented 1 year ago

If you don't want to wait for the next release, pip can install from the main branch on github directly with pip install git+https://github.com/fsspec/kerchunk