dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.6k stars 174 forks source link

Data type not detected if first batch of records is null #1666

Closed rubenhelsloot closed 2 months ago

rubenhelsloot commented 3 months ago

dlt version

0.5.2

Describe the problem

I have a SQL database in SAP HANA that I want to transport to Postgres using DLT. However, I am running into an issue for a few of the tables that have a lot of NULLs. The columns are of type TIMESTAMP, BOOLEAN, or INT64, but there are so many NULLs in there, that the first chunk of loaded SQL contains only NULLs.

Therefore, the data type is incorrectly inferred as STRING, which then leads to an error loading the Parquet files:

ValueError: Table schema does not match schema used to create file: 
table:
objectid: double
...
my_field: double vs. 
file:
objectid: double
...
my_field: string

Some notes

date_cols = [
    col for col in self.columns
    if self.columns[col].get("data_type") in ("timestamp", "date", "time", "timestamptz")
]

df = _wrap_result(
    partition,
    columns,
    parse_dates=date_cols or None,
    **{"dtype_backend": "pyarrow", **backend_kwargs},
)

Expected behavior

The dataframe chunks should extract the data type from the column definitions - at least if the values are all nulls

Steps to reproduce

Create a database with a table with only NULLs in the first rows. GIve it a date type that is not string. Then set the chunks to be smaller than the number of leading NULLs.

Operating system

macOS

Runtime environment

Airflow

Python version

3.11

dlt data source

sql_database

dlt destination

Postgres

Other deployment details

No response

Additional information

No response

AndrewBasem1 commented 3 months ago

I've checked and was able to replicate this issue, I'd be happy to contribute this as my first contribution but would need some guidance if that's possible

rudolfix commented 3 months ago

@rubenhelsloot is there anything that prevents you from using arrow backend? in that case we create the arrow schema ourselves and data gets properly preserved. if you are not interacting with panda frames in your pipeline there's no reason to use pandas really, panda frames are serialized as parquet before they are loaded so you need arrow anyway

it would be cool to know why you need pandas :)

@AndrewBasem1 We could implement the same thing for pandas, first you'd need to map dlt types to pandas types and then type the dataframe by passing dtype: DtypeArg | None = None, argument to wrap results. at the end the code will be similar to this:

https://github.com/dlt-hub/verified-sources/blob/master/sources/sql_database/arrow_helpers.py

code above is handling a lot of edge cases ie. using pyarrow to infer unknown data types from the data, while enforcing known types. radical solution would be to always use arrow and just convert it to pandas at the very end. we however added pandas to use their type inference in the first place...

rubenhelsloot commented 2 months ago

Hi @rudolfix,

I was able to fix the issue by switching to arrow, thanks! If I recall correctly, I avoided doing that because I got a different error in the beginning (probably it was _dlt_load_id and _dlt_id missing), and wanted to avoid having to do a full refresh on the existing pipeline

rudolfix commented 2 months ago

@rubenhelsloot we'll consider to implement pandas anyway. we are investigating this #1507 to expose dataframes from any destination and that will be needed