Closed rubenhelsloot closed 2 months ago
I've checked and was able to replicate this issue, I'd be happy to contribute this as my first contribution but would need some guidance if that's possible
@rubenhelsloot is there anything that prevents you from using arrow
backend? in that case we create the arrow schema ourselves and data gets properly preserved. if you are not interacting with panda frames in your pipeline there's no reason to use pandas really, panda frames are serialized as parquet before they are loaded so you need arrow anyway
it would be cool to know why you need pandas :)
@AndrewBasem1
We could implement the same thing for pandas, first you'd need to map dlt types to pandas types and then type the dataframe by passing dtype: DtypeArg | None = None,
argument to wrap results. at the end the code will be similar to this:
https://github.com/dlt-hub/verified-sources/blob/master/sources/sql_database/arrow_helpers.py
code above is handling a lot of edge cases ie. using pyarrow to infer unknown data types from the data, while enforcing known types. radical solution would be to always use arrow and just convert it to pandas at the very end. we however added pandas to use their type inference in the first place...
Hi @rudolfix,
I was able to fix the issue by switching to arrow
, thanks!
If I recall correctly, I avoided doing that because I got a different error in the beginning (probably it was _dlt_load_id and _dlt_id missing), and wanted to avoid having to do a full refresh on the existing pipeline
@rubenhelsloot we'll consider to implement pandas anyway. we are investigating this #1507 to expose dataframes from any destination and that will be needed
dlt version
0.5.2
Describe the problem
I have a SQL database in SAP HANA that I want to transport to Postgres using DLT. However, I am running into an issue for a few of the tables that have a lot of NULLs. The columns are of type
TIMESTAMP
,BOOLEAN
, orINT64
, but there are so many NULLs in there, that the first chunk of loaded SQL contains only NULLs.Therefore, the data type is incorrectly inferred as STRING, which then leads to an error loading the Parquet files:
Some notes
pandas
backend of thesql_database
source.TableLoader._load_rows
:Expected behavior
The dataframe chunks should extract the data type from the column definitions - at least if the values are all nulls
Steps to reproduce
Create a database with a table with only NULLs in the first rows. GIve it a date type that is not string. Then set the chunks to be smaller than the number of leading NULLs.
Operating system
macOS
Runtime environment
Airflow
Python version
3.11
dlt data source
sql_database
dlt destination
Postgres
Other deployment details
No response
Additional information
No response