airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.05k stars 4.11k forks source link

Support more compressions in S3 Source #5110

Open Phlair opened 3 years ago

Phlair commented 3 years ago

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.

Describe the solution you’d like

Build in support for other compression such as:

This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.

If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

darian-heede commented 2 years ago

Hi, is there an update for this issue? I'm especially interested in having zip compression support for this source connector.

ryjabe commented 1 year ago

+1 to this goal, specifically for support of the snappy compression. The lack of support there is blocking users I have been collaborating with.

YowanR commented 1 year ago

@lazebnyi Can you please scope this issue? Especially support for snappy compression, please? cc @davydov-d as well I'd like to understand the LoE involved here, please 🙏

davydov-d commented 1 year ago

Tell us about the problem you're trying to solve

Currently S3 (via abstract files source) only supports gzip and bzip2 compressions.

Describe the solution you’d like

Build in support for other compression such as:

  • Zip
  • Lzma
  • Xz
  • Snappy

This should extend to all supported file formats (where appropriate) and sit within abstract-files-source so that any new files storage source built on top of this inherits the compression support.

If that link is broken we may have refactored where that code sits and I forgot to update this, comment @ me and I'll fix them!

@YowanR do we need all these compression types for all the supported formats? I yes, the level of effort should be high. I made a quick research and figured out that currently we support 4 formats (avro, jsonl, parquet and csv). All of them except for the avro format are backed by the pyarrow library. It supports reading bzip2 and gzip by default. Also it supports brotli, lz4, snappy, zstd but as far as I understand we'll need some work to do to integrate these compression types. I did not find any mentions in the doc about a regular zip, lzma or xz support. Talking about the avro format - it is backed by the fastavro library, and its doc says that it supports snappy, deflate, zstandard, bzip2, lz4, xz. So, I think we'd better decompose this task, here's my suggestion based on the LoE:

  1. Support snappy and xz for avro -- only need to verify things work by default
  2. Support snappy for other file formats -- need to do some coding but use existing tools.
  3. Support zip, lzma, xz (except for avro) -- need to find or implement new solutions
ryjabe commented 1 year ago

To add more context to my previous note, the user required snappy for Parquet files.

Thank you Denys for the additional scoping here.