Open jorritsandbrink opened 2 weeks ago
0.9.9a1
I observe inconsistent behavior between different file formats in the case of variant columns on the filesystem destination.
filesystem
When loading {"foo": 1} in run 1 and {"foo": "foo"} in run 2, I see the following in the loaded file for run 2 with varying formats: jsonl:
{"foo": 1}
{"foo": "foo"}
jsonl
{"foo__v_text":"foo","_dlt_load_id":"1726325620.5591211","_dlt_id":"5S+YtEla4uOxvQ"}
csv:
csv
"foo","_dlt_load_id","_dlt_id","foo__v_text" "","1726324877.9022982","p3KiDsSpOwLI8w","foo"
parquet (displayed as jsonl)
parquet
{"foo":null,"_dlt_load_id":"1726325669.691154","_dlt_id":"D4glk8Nx70uj+Q","foo__v_text":"foo"}
jsonl does not contain foo, while csv and parquet do contain foo with a "null" value.
foo
Update:
There is even more inconsistency when setting nullable to False with a column hint.
nullable
False
""
0
null
Consistency between file formats.
import os import dlt from dlt.destinations import filesystem os.environ["DATA_WRITER__DISABLE_COMPRESSION"] = "true" @dlt.resource( file_format="jsonl", # change for other formats # columns={"foo": {"nullable": False}} # uncomment for not nullable case } def inconsistent_data(dtype: str): if dtype == "bigint": yield {"foo": 1} elif dtype == "text": yield {"foo": "foo"} pipe = dlt.pipeline( pipeline_name="variant", pipelines_dir="_storage", destination=filesystem("_storage"), ) pipe.run(inconsistent_data("bigint")) pipe.run(inconsistent_data("text"))
Linux
Local
3.10
No response
Filesystem & buckets
dlt version
0.9.9a1
Describe the problem
I observe inconsistent behavior between different file formats in the case of variant columns on the
filesystem
destination.When loading
{"foo": 1}
in run 1 and{"foo": "foo"}
in run 2, I see the following in the loaded file for run 2 with varying formats:jsonl
:csv
:parquet
(displayed as jsonl)jsonl
does not containfoo
, whilecsv
andparquet
do containfoo
with a "null" value.Update:
There is even more inconsistency when setting
nullable
toFalse
with a column hint.jsonl
:foo
is absent (so non-nullability is not respected) (same as before)csv
:foo
is""
(same as before)parquet
:foo
is0
(different than before, when it wasnull
)Expected behavior
Consistency between file formats.
Steps to reproduce
Operating system
Linux
Runtime environment
Local
Python version
3.10
dlt data source
No response
dlt destination
Filesystem & buckets
Other deployment details
No response
Additional information
No response