dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.55k stars 170 forks source link

Filesystem destination does not respect preferred_loader_file_format for dlt metadata #1631

Open Nintorac opened 3 months ago

Nintorac commented 3 months ago

dlt version

dlt==0.5.1

Describe the problem

Configuring the preferred_loader_file_format for the filesystem destination does not respect preferred_loader_file_format kwarg

Further discussion here

Expected behavior

When configuring preferred_loader_file_format="parquet" I expect the metadata files to be in parquet format, instead they are jsonl.

Steps to reproduce

  1. Run this code
from dlt.destinations import filesystem
parquet_file_system = filesystem(
    preferred_loader_file_format="parquet"
)

pipeline = dlt.pipeline(
    pipeline_name='pipeline',
    destination=parquet_file_system,
    dataset_name='dataset',
)

data = [ {'b': 2} ]
pipeline_run = pipeline_local.run(
    data, 
    table_name='repro',
)
  1. Observe metadata files are jsonl, rather than the expected parquet

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

No response

sh-rp commented 3 months ago

Hey @Nintorac this is an implementation decision and not a bug, I agree though that we should probably add a note about it in the docs. Is the fact that the metadata tables are stored as jsonl posing a problem for you at this time?

Nintorac commented 3 months ago

Mainly my aversion to jsonl for now aha, but some issues I forsee

Would be interested to know why the metadata table write mechanism doesn't use the same pathway as data table write though? from my limited perspective it seems like this functionality should be implemented at the abstract destination level

sh-rp commented 3 months ago

@Nintorac ok I understand. So you are actually reading the metadata files in your code? I was more or less working under the assumption that they are for internal dlt use only. But it is a fair point.

Nintorac commented 3 months ago

I was intending to use it for change data capture for scd2 type tables (since this isn't supported natively)

But I wasn't aware they were meant for internal use only.

sh-rp commented 3 months ago

I'd say they are not strictly meant for internal use, I just didn't expect anyone wanting to query them in the way you describe. scd2 tables currently are not supported for the filesystem by the way (although with the delta tables it should actually work). Could you explain in a bit more detail what you want to do? I'd like to understand the use-case and maybe offer some help or take some inspirations for further work on the filesystem.