Open Phlair opened 3 years ago
Hi, is there an update for this issue? I'm especially interested in having zip compression support for this source connector.
+1 to this goal, specifically for support of the snappy compression. The lack of support there is blocking users I have been collaborating with.
@lazebnyi Can you please scope this issue? Especially support for snappy compression, please? cc @davydov-d as well I'd like to understand the LoE involved here, please 🙏
Tell us about the problem you're trying to solve
Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.
Describe the solution you’d like
Build in support for other compression such as:
- Zip
- Lzma
- Xz
- Snappy
This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.
If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!
@YowanR do we need all these compression types for all the supported formats? I yes, the level of effort should be high.
I made a quick research and figured out that currently we support 4 formats (avro
, jsonl
, parquet
and csv
). All of them except for the avro
format are backed by the pyarrow
library. It supports reading bzip2
and gzip
by default. Also it supports brotli
, lz4
, snappy
, zstd
but as far as I understand we'll need some work to do to integrate these compression types. I did not find any mentions in the doc about a regular zip
, lzma
or xz
support. Talking about the avro
format - it is backed by the fastavro
library, and its doc says that it supports snappy
, deflate
, zstandard
, bzip2
, lz4
, xz
. So, I think we'd better decompose this task, here's my suggestion based on the LoE:
snappy
and xz
for avro
-- only need to verify things work by defaultsnappy
for other file formats -- need to do some coding but use existing tools.zip
, lzma
, xz
(except for avro
) -- need to find or implement new solutionsTo add more context to my previous note, the user required snappy for Parquet files.
Thank you Denys for the additional scoping here.
Tell us about the problem you're trying to solve
Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.
Describe the solution you’d like
Build in support for other compression such as:
This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.
If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!