fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
312 stars 80 forks source link

MultiZarrToZarr does not pass `target_options` to the reference filesystems #522

Open reweeden opened 1 month ago

reweeden commented 1 month ago

Hi there! I was trying to get MultiZarrToZarr to work on a set of files that have been gzip compressed. Fsspec's fsspec.open function has a compression parameter that can be passed to tell fsspec to decompress the files on the fly. I have gotten this to work using zarr files by doing something like this:

fsspec.filesystem(
    "reference",
    fo="s3://bucket/my-compressed-zarr-file.json.gz",
    target_options={'compression': 'gzip'},
    remote_protocol='s3',
    remote_options={
        'anon': False,
    },
)

I was hoping that this would also work with MultiZarrToZarr and be equally as simple as passing the target_options, however, it seems that while MultiZarrToZarr opens the files using the target_options here to get the file name: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L265

It then does not pass the target options to the fsspec.filesystem call here:

https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277-L283

Is this a bug or is there some reason why the target_options can't be passed to fsspec.filesystem there?

Analysis of Compressed files

From what I can tell by running this with a debugger, this is what's happening:

  1. The call to open_files here succeeds: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L265
  2. The call to fs.cat does not include the target_options so the file is not decompressed when read: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L270
  3. The json decoding fails because it's trying to decode the compressed data and the fo_list is set to the original list of filenames. https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L274
  4. The fsspec.filesystem call opens the files again, but since the target_options are not passed in, the data is not decompressed and the json decoding fails: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277

Solution

I think there are 2 additional places that the target options need to be passed in

  1. ~~In the fs.cat call via the **kwargs parameter: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L270~~

API Docs for fs.cat: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem Source code for fs.cat: https://github.com/fsspec/filesystem_spec/blob/4517882f67d635d50b54cd53fd04ee3a37b6943c/fsspec/spec.py#L844

EDIT: After trying this out, it seems that the s3fs implementation for cat_file doesn't work the way the syncronous abstract class does where the kwargs are passed to a call to fs.open. The s3fs _cat_file doesn't support kwargs at all: https://github.com/fsspec/s3fs/blob/f3f63cbfbfe71a4355abd63cafd8c678c4a5a0af/s3fs/core.py#L1113

  1. In the fs.filesystem call: https://github.com/fsspec/kerchunk/blob/dc66b2cd85ce170fbc0fbc652cc80f54439bd786/kerchunk/combine.py#L277

Workaround

I believe I can work around this by opening the files myself and passing in the zarr dictionaries directly. It's just more code for me to write :)

martindurant commented 1 month ago

OK, so target_options in your case is something that needs to go to open(), not something that the whole target filesystem has configured. Since the file in question was just opened correctly a couple of lines above, I'm not sure why it's being unbundled like this.

with of as f_list:
     fo_list = [ujson.loads(v) for v in f_list]
reweeden commented 1 month ago

I was looking through the repo history and it seems that in the past the of objects returned by fsspec.open were passed to the filesystem call directly. But then at some point it was changed to only pass the of.full_name attributes instead, causing the filesystem to have to re-open the files.

martindurant commented 2 weeks ago

Please do make a PR to pass the options for now. We can consider whether it's possible to fuse the open() calls.