dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.75k stars 181 forks source link

Save compressed load files with `.gz` extension #925

Open steinitzu opened 10 months ago

steinitzu commented 10 months ago

Problem description

Load files are always saved with the extension of the file format, regardless of whether compression is enabled.
I.e. s3://path/to/load/file.jsonl may or may not be compressed.
This causes issues with e.g. databricks loader which can't parse gzip files without .gz extension so compression must be disabled. Possibly affects snowflake and redshift too, I think we assume json files are always compressed.

Solution

The data writer should add .gz extension when compression is enabled.
There are few places where we parse filenames to detect format, usually relying on filename.endswith('.<file_format>') or os.path.splitext. These need to be refactored, we need a clean way to get format/compression from filename.

elviskahoro commented 2 days ago

+1 this!

You can disable compression by setting the env variable: os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = str(True)

rudolfix commented 8 hours ago

@elviskahoro there are many ways to disable compression in code. all of them change the configuration though ie. os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True" dlt.config["data_writer.disable_compression"] = True