Open wachsylon opened 1 week ago
btw, i tried to pass this option with
cd={"compression": "infer"}
fs = fsspec.filesystem(
"reference",
fo="test.parq",
target_options=cd,
remote_options=cd,
storage_options=cd,
**cd
)
but none of these settings make a difference. i added a print
for the kwargs
in the cat_ranges
function and the dict is empty.
If you are using parquet storage, the references are already compressed internally by the design of the parquet file format. It uses the Zstd algorithm, which appears to be a good choice for this kind of data. Further compression would not be useful.
I think there is a missunderstanding. I try to be more clear:
Originally:
File1 <- reference-table
Anywhere in the reference file system when accessing the data throug references:
fs.open(File1)
After compression: File1compr <- reference table
where file1compr has the same name as before and would also have the same chunks after decompressing.
Why can't I pass option to the fs.open
? Like
fs.open(file1compr,**kwargs)
with kwargs=dict(compression="infer")
. I would like to leave the table as it is.
The question is, is this parquet? If yes, I don't think there's a code path to add arguments to open(); but, again, I am doubtful that compression gains you much for this case.
Parquet or json should not be relevant to my issue. It is about the files that are referenced inside the jsons or parquets in the path
column. I gain storage space if I compress these referenced files.
I thought I have to specify remote_options
to pass sth to the fs
that is used to work with the files. But it seems like these options are not passed correctly.
One related thing that I find suspicious is that there are functions that accept **kwargs
and which call again functions that accept **kwargs
but do not pass them when calling them. E.g.:
the files that are referenced inside
Ah, sorry I misunderstood you. In typical use with zarr, the compression of data blocks is handled by zarr not fsspec, which is why this idea didn't come up before. The storage_options are indeed used to configure the filesystem with which to get the contents of each reference, and not used in open(); in fact, it uses cat/cat_ranges, which has no compression option at all.
Is your use case zarr? Then, you could add the lz4 codec to your .zarray spec. However, it will only work if the compression is for whole blocks, not for a file containing blocks. This is because lz4 (and other compressors) do not support random access within the compressed stream. When you open a file with compression and seek(), you actually need to stream through the data to get to the requested location.
I created references with kerchunk. Because of missing storage space, I was thinking of just compressing the files which are referenced with
lz4
. My question is: Could I leave the references as is and just pass thecompress="infer"
option somewhere in the reference file system so that it is used for opening the referenced file? I guess it would make sense to cache the uncompressed referenced files so that they can be reused if multiple chunks are in one reference file. Does that work somehow?If this is not feasable, could we kerchunk files that are compressed with
lz4
oderzstandard
similar to zip and tar? afai understood, cat ranges is also possible on the parquet zstandard compressed tables.If both approaches are possible, what would you recommend?