Open reweeden opened 1 month ago
OK, so target_options in your case is something that needs to go to open(), not something that the whole target filesystem has configured. Since the file in question was just opened correctly a couple of lines above, I'm not sure why it's being unbundled like this.
with of as f_list:
fo_list = [ujson.loads(v) for v in f_list]
I was looking through the repo history and it seems that in the past the of
objects returned by fsspec.open
were passed to the filesystem call directly. But then at some point it was changed to only pass the of.full_name
attributes instead, causing the filesystem to have to re-open the files.
Please do make a PR to pass the options for now. We can consider whether it's possible to fuse the open() calls.
Hi there! I was trying to get
MultiZarrToZarr
to work on a set of files that have been gzip compressed. Fsspec'sfsspec.open
function has acompression
parameter that can be passed to tellfsspec
to decompress the files on the fly. I have gotten this to work using zarr files by doing something like this:I was hoping that this would also work with
MultiZarrToZarr
and be equally as simple as passing thetarget_options
, however, it seems that whileMultiZarrToZarr
opens the files using thetarget_options
here to get the file name: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L265It then does not pass the target options to the
fsspec.filesystem
call here:https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277-L283
Is this a bug or is there some reason why the
target_options
can't be passed tofsspec.filesystem
there?Analysis of Compressed files
From what I can tell by running this with a debugger, this is what's happening:
open_files
here succeeds: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L265fs.cat
does not include thetarget_options
so the file is not decompressed when read: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L270fo_list
is set to the original list of filenames. https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L274fsspec.filesystem
call opens the files again, but since thetarget_options
are not passed in, the data is not decompressed and the json decoding fails: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277Solution
I think there are 2 additional places that the target options need to be passed in
fs.cat
call via the**kwargs
parameter: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L270~~API Docs forfs.cat
: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystemSource code forfs.cat
: https://github.com/fsspec/filesystem_spec/blob/4517882f67d635d50b54cd53fd04ee3a37b6943c/fsspec/spec.py#L844EDIT: After trying this out, it seems that the
s3fs
implementation forcat_file
doesn't work the way the syncronous abstract class does where thekwargs
are passed to a call tofs.open
. The s3fs_cat_file
doesn't support kwargs at all: https://github.com/fsspec/s3fs/blob/f3f63cbfbfe71a4355abd63cafd8c678c4a5a0af/s3fs/core.py#L1113fs.filesystem
call: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277Workaround
I believe I can work around this by opening the files myself and passing in the zarr dictionaries directly. It's just more code for me to write :)