Open leifdenby opened 1 year ago
Your traceback doesn't quite line up with the current state of the code. Would you mind trying to find exactly which line in kerchunk.combine caused this? Note that on linux and mac, it is easy to change the allowed number of open files with ulimit
, which is by default rather small particularly on mac. 4096 is indeed a lot of input JSON files, but hardly extreme.
The entry in my code that calls MultiZarrToZarr.translate(...)
is here: https://github.com/leifdenby/uclales-zarr/blob/master/uclales_zarr/uclales_zarr.py#L90. The kerchunk and fsspec versions are:
kerchunk==0.0.7
fsspec==2022.5.0
fsspec-reference-maker==0.0.4
Looking at the kerchunk==0.0.7
source tree this seems to be where kerchunk
calles fsspec
: https://github.com/fsspec/kerchunk/blob/0.0.7/kerchunk/combine.py#L149
The hard limit on number of open files on the system I'm using is 4096.
$> ulimit -Hn
4096
And I don't have root access so I can't increase above this value.
Shouldn't it be possible to avoid having all the json-files open at once and instead open/close them one-by-one?
you can work around this by manually opening and json-decoding the files, and then passing the resulting list of dict to MultiZarrToZarr
:
fs, _ = fsspec.core.url_to_fs(urls[0], **target_options)
refs = [ujson.loads(c) for c in fs.cat(urls).values()]
mzz = MultiZarrToZarr(refs, ...)
The version on main
should not have this issue anymore since it uses fs.cat
(assuming fsspec.open_files
does not actually open the local files) so we'd just have to wait on the next release.
If you don't want to wait for the next release, pip can install from the main branch on github directly with pip install git+https://github.com/fsspec/kerchunk
First, thanks for a great project!
I work with output from a Large-Eddy Simulation model (UCLA-LES) which decomposes the 3D spatial domain into horizontally separated columns so that each MPI core only handles the fluid evolution within a small horizontal subdomain and each core writes a separate netCDF file for that subdomain. This means that once the simulation is done I need to combine these individual column-files in some way.
Using
kerchunk
I've been writing a little command line utility (https://github.com/leifdenby/uclales-zarr) to produce a zarr-archive (json) file that describes the whole simulation output without the need to merge all these files into a single netCDF file. I've gotten this working when I have a small number of source files (a10x10
grid of columns, so100
source netCDF files) and it works great to later load in with xarray(!) But with a larger simulation with64x64=4096
columns I get an theOSError: Too many open files
exception being raised. Have others come across this issue? Am I missing something obvious? I had a quick glance insideMultiZarrToZarr.fss
inside ofkerchunk
(https://github.com/fsspec/kerchunk/blob/main/kerchunk/combine.py#L147) and it looks like the json files are simply looped over, is there maybe a way to ensure the files are closed here while looping through? Stacktrace below.Sorry if this isn't the intended use of
kerchunk
:innocent:, I thought I'd give it a try and was wondering if there's an easy fix :smile:Thanks again!
PS. Thanks @martindurant for adding netcdf3 support recently. Some of my models output files are written in netCDF3 format. I'm in the process of changing that, but we have quite a lot of old simulations lying around that use netCDF3, so it's nice to having to convert to netCDF4 first.