Open steinitzu opened 10 months ago
+1 this!
You can disable compression by setting the env variable: os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = str(True)
@elviskahoro there are many ways to disable compression in code. all of them change the configuration though ie.
os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "True"
dlt.config["data_writer.disable_compression"] = True
Problem description
Load files are always saved with the extension of the file format, regardless of whether compression is enabled.
I.e.
s3://path/to/load/file.jsonl
may or may not be compressed.This causes issues with e.g. databricks loader which can't parse gzip files without
.gz
extension so compression must be disabled. Possibly affects snowflake and redshift too, I think we assume json files are always compressed.Solution
The data writer should add
.gz
extension when compression is enabled.There are few places where we parse filenames to detect format, usually relying on
filename.endswith('.<file_format>')
oros.path.splitext
. These need to be refactored, we need a clean way to get format/compression from filename.