dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.22k stars 144 forks source link

add "Normalize" as a boolean argument to pipeline.run #1556

Closed dbrtly closed 1 month ago

dbrtly commented 1 month ago

Feature description

BigQuery has a type "record" with mode repeated to store denormalized data in a table. By default, dlt currently denormalises this data into separate tables. There are times when retaining the schema as per the exact structure generated at the source is preferable because it eliminates joining the tables as an obstacle to querying them This is particularly relevant for data that contains many of these records.

Are you a dlt user?

I'd consider using dlt, but it's lacking a feature I need.

Use case

ingestion with guardrails and consistent structure

Proposed solution

I'd like to be able to ingest one table chess_pipeline.games_data with a call like:

dlt.pipeline( pipeline_name="chess_pipeline", destination='bigquery', dataset_name="games_data", normalize=False, ) with schema like:

` name: chess tables: players_games: columns:

white: data_type: struct repeated: true columns: rating: data_type: bigint nullable: true result: data_type: text nullable: true aid: data_type: text nullable: true username: data_type: text nullable: true uuid: data_type: text nullable: true ` ### Related issues _No response_
rudolfix commented 1 month ago

@dbrtly you can set the nesting level for any resource both when it is declared and after that via a property: https://dlthub.com/devel/general-usage/resource#reduce-the-nesting-level-of-generated-tables

nested fields will be stored as JSON in BigQuery. I hope that it will be enough for you. Currently we are not supporting struct fields in our standard bigquery destination.

there's a custom BigQuery destination that will do that (it uses BigQuery auto schema inference) https://dlthub.com/devel/examples/custom_destination_bigquery

as you can see it is practically an one liner :) but it does not support merges. We are looking for some kind of clever way to implement structured fields for some destinations but it looks quite complicated and hard to generalize

dbrtly commented 1 month ago

In my imagination of the solution, the nested tables are bigquery "record" type with native types inside that (I provided an example above but the white space is messed up).

Reverting to json seems a missed opportunity.

https://raw.githubusercontent.com/CVEProject/cvelist/master/2021/44xxx/CVE-2021-44228.json

As an example, loading the above file directly to BigQuery using the bq cli tool will create one table with several structs.

For context, BigQuery can infer the schema from the first 500 lines of jsonl files with auto-detect. This includes repeated records and native types within the structs. The bigquery-schema-generator library on pypi removes the 500 line limit.

In contrast, dlt creates 10 data tables:

┌───────┬─────────────────────────────────────────┐
│ table_schema    │                                    table_name                                                                                                   │
│ nesting               │ cves                                                                                                                                                  │
│ nesting               │ cves__affects__vendor__vendor_data                                                                                         │
│ nesting               │ cves__affects__vendor__vendor_data__product__product_data                                            │
│ nesting               │ cves__affects__vendor__vendor_data__product__product_data__version__version_data  │
│ nesting               │ cves__credit                                                                                                                                    │
│ nesting               │ cves__description__description_data                                                                                          │
│ nesting               │ cves__impact                                                                                                                                  │
│ nesting               │ cves__problemtype__problemtype_data                                                                                     │
│ nesting               │ cves__problemtype__problemtype_data__description                                                              │
│ nesting               │ cves__references__reference_data                                                                                              │

I agree that sometimes normalizing the data makes it easier to consume.

For the raw data that outside that set, I would find it easiest to consume downstream if the ingestion process did not denormalize but still retained the native types (not just dump json).

rudolfix commented 1 month ago

Yes, this is a little bit of missed opportunity and makes working with nested data harder. Requests to support nested types come from time to time from our users and at some point we'll add generic (destination independent) support. tbh. it looks like several weeks project to me and before writing a "normalizer" you mention, we'll need to add nested types at all and ie. allow to load parquet files containing them: https://github.com/dlt-hub/dlt/issues/1592

Regarding BigQuery: in this case things are simple. It can infer schema from data and will also follow parquet schemas - all in a very similar way to dlt so actually letting people choose which tables dlt should manage and which BigQuery allows to add nested types pretty seamlessly. https://github.com/dlt-hub/dlt/pull/1591 - that was unexpectedly easy and we'll probably merge this into core.

dbrtly commented 1 month ago

Progress. 😀

dbrtly commented 1 month ago

Thinking about how this could go wrong. BigQuery can infer the schema from only the first 500 lines of jsonl files with auto-detect. This includes repeated records and native types within the structs.

This means that if the 501st row changes from int to string for example, the load with autodetect feature will not handle the string correctly and will fail. The bigquery-schema-generator library on pypi removes the 500 line limit.

We should probably have tests for bigquery that attempt to load 1000 rows where the first 500 are simple types but later rows are different. Example int that changes to str, and datetime that changes to non-ISO8601 and boolean that changes to string.

rudolfix commented 1 month ago

I consider BigQuery PR above an useful hack for power users... When we do #1592 (full implementation), we'll use the whole available dataset to infer the schema (this is what dlt does now). If breaking changes come later we are generating variant columns which hold types that cannot be coerced. https://dlthub.com/docs/general-usage/schema#variant-columns

btw. I looked at bigquery-schema-generator, it is cool and maybe we can copy some ideas from there :)

rudolfix commented 1 month ago

I'm closing this in favor of #1592 where the relational normalizer will be replaced with flat one but supporting nested types