fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

LazyReference: Passing options to `.open` of the reference file #1746

Open wachsylon opened 1 week ago

wachsylon commented 1 week ago

I created references with kerchunk. Because of missing storage space, I was thinking of just compressing the files which are referenced with lz4. My question is: Could I leave the references as is and just pass the compress="infer" option somewhere in the reference file system so that it is used for opening the referenced file? I guess it would make sense to cache the uncompressed referenced files so that they can be reused if multiple chunks are in one reference file. Does that work somehow?

If this is not feasable, could we kerchunk files that are compressed with lz4 oder zstandard similar to zip and tar? afai understood, cat ranges is also possible on the parquet zstandard compressed tables.

If both approaches are possible, what would you recommend?

wachsylon commented 1 week ago

btw, i tried to pass this option with

cd={"compression": "infer"}
fs = fsspec.filesystem(
    "reference", 
    fo="test.parq", 
    target_options=cd, 
    remote_options=cd,
    storage_options=cd,
    **cd
)

but none of these settings make a difference. i added a print for the kwargs in the cat_ranges function and the dict is empty.

martindurant commented 1 week ago

If you are using parquet storage, the references are already compressed internally by the design of the parquet file format. It uses the Zstd algorithm, which appears to be a good choice for this kind of data. Further compression would not be useful.

wachsylon commented 1 week ago

I think there is a missunderstanding. I try to be more clear: Originally: File1 <- reference-table Anywhere in the reference file system when accessing the data throug references: fs.open(File1)

After compression: File1compr <- reference table

where file1compr has the same name as before and would also have the same chunks after decompressing.

Why can't I pass option to the fs.open? Like fs.open(file1compr,**kwargs) with kwargs=dict(compression="infer"). I would like to leave the table as it is.

martindurant commented 1 week ago

The question is, is this parquet? If yes, I don't think there's a code path to add arguments to open(); but, again, I am doubtful that compression gains you much for this case.

wachsylon commented 2 days ago

Parquet or json should not be relevant to my issue. It is about the files that are referenced inside the jsons or parquets in the pathcolumn. I gain storage space if I compress these referenced files.

I thought I have to specify remote_options to pass sth to the fs that is used to work with the files. But it seems like these options are not passed correctly.

One related thing that I find suspicious is that there are functions that accept **kwargs and which call again functions that accept **kwargs but do not pass them when calling them. E.g.:

https://github.com/fsspec/filesystem_spec/blob/9a161714f0bbfe44ee769f259420f2f7db975471/fsspec/spec.py#L836

martindurant commented 2 days ago

the files that are referenced inside

Ah, sorry I misunderstood you. In typical use with zarr, the compression of data blocks is handled by zarr not fsspec, which is why this idea didn't come up before. The storage_options are indeed used to configure the filesystem with which to get the contents of each reference, and not used in open(); in fact, it uses cat/cat_ranges, which has no compression option at all.

Is your use case zarr? Then, you could add the lz4 codec to your .zarray spec. However, it will only work if the compression is for whole blocks, not for a file containing blocks. This is because lz4 (and other compressors) do not support random access within the compressed stream. When you open a file with compression and seek(), you actually need to stream through the data to get to the requested location.