Closed ThomasMargnac closed 9 months ago
This is because Pandas has a lovely index.... So, the main issue is that on the first write the index column stayed in the data while writing to parquet. What PyArrow version are you using?
I can see pa.Table.from_pandas
has preserve_index
parameter,
I am using PyArrow 12.0.0
Same issue here. @ion-elgreco suggestion fixed it
data = pa.Table.from_pandas(data, preserve_index=False)
write_deltalake(
table_or_uri=s3_endpoint,
data=data,
mode='append',
storage_options=storage_options
)
Thanks a lot @ion-elgreco and @titowoche30, it worked! After I had an issue on the first write related to the data schema but I fixed my data schema in pa.Table.from_pandas
and everything works fine.
Environment
Delta-rs version: 0.10.2
Binding: Python 3.9.17
Environment: local
Bug
What happened:
I am trying to pull new data (which contains text) from a delta table in my bucket A, apply some transformations to it (removing urls, removing hashtags, …) and finally load transformed data into a delta table in my bucket B. The first time I ran this pipeline, it worked perfectly fine. Then I inserted new data in my delta table (bucket A). The second time, it failed and displayed the following error:
Apparently, a column named "index_level_0" is required but it is not a column defined by me.
What you expected to happen:
I expected my transformed data to be stored in my delta table (bucket B) without a problem.
How to reproduce it:
Here is my Python script to reproduce it:
More details: