dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

file format inconsistencies with variant columns #1807

Open jorritsandbrink opened 2 weeks ago

jorritsandbrink commented 2 weeks ago

dlt version

0.9.9a1

Describe the problem

I observe inconsistent behavior between different file formats in the case of variant columns on the filesystem destination.

When loading {"foo": 1} in run 1 and {"foo": "foo"} in run 2, I see the following in the loaded file for run 2 with varying formats: jsonl:

{"foo__v_text":"foo","_dlt_load_id":"1726325620.5591211","_dlt_id":"5S+YtEla4uOxvQ"}

csv:

"foo","_dlt_load_id","_dlt_id","foo__v_text"
"","1726324877.9022982","p3KiDsSpOwLI8w","foo" 

parquet (displayed as jsonl)

{"foo":null,"_dlt_load_id":"1726325669.691154","_dlt_id":"D4glk8Nx70uj+Q","foo__v_text":"foo"}

jsonl does not contain foo, while csv and parquet do contain foo with a "null" value.

Update:

There is even more inconsistency when setting nullable to False with a column hint.

Expected behavior

Consistency between file formats.

Steps to reproduce

import os
import dlt
from dlt.destinations import filesystem

os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "true"

@dlt.resource(
    file_format="jsonl",  # change for other formats
    # columns={"foo": {"nullable": False}}  # uncomment for not nullable case
}
def inconsistent_data(dtype: str):
    if dtype == "bigint":
        yield {"foo": 1}
    elif dtype == "text":
        yield {"foo": "foo"}

pipe = dlt.pipeline(
    pipeline_name="variant",
    pipelines_dir="_storage",
    destination=filesystem("_storage"),
)

pipe.run(inconsistent_data("bigint"))
pipe.run(inconsistent_data("text"))

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

No response