Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True #185
This worked on previous version: Arrow 0.9, Turbodbc 2.7
Current version: Arrow 0.11, Turbodbc 3.0
Workaround: I used strings_as_dictionary=False instead.
for batch in cursor.fetcharrowbatches(strings_as_dictionary=True):
if schema == None:
schema = batch.schema
writer = pq.ParquetWriter(local_output_path, schema, compression='gzip')
writer.write_table(batch)
I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example:
Table schema does not match schema used to create file:
table:
...
value: dictionary<values=string, indices=int16, ordered=0>
dictionary:
[
"3121.0",
"3136.0",
...
"3170.0"
]
vs
file:
value: dictionary<values=string, indices=int16, ordered=0>
dictionary:
[
"3183.0",
"3125.0",
...
"3199.0"
]
This worked on previous version: Arrow 0.9, Turbodbc 2.7 Current version: Arrow 0.11, Turbodbc 3.0
Workaround: I used strings_as_dictionary=False instead.
I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example: