blue-yonder / turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.
http://turbodbc.readthedocs.io/en/latest
MIT License
614 stars 86 forks source link

"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True #185

Open chriscomeau79 opened 5 years ago

chriscomeau79 commented 5 years ago

This worked on previous version: Arrow 0.9, Turbodbc 2.7 Current version: Arrow 0.11, Turbodbc 3.0

Workaround: I used strings_as_dictionary=False instead.

for batch in cursor.fetcharrowbatches(strings_as_dictionary=True):
  if schema == None:
    schema = batch.schema
    writer = pq.ParquetWriter(local_output_path, schema, compression='gzip')
  writer.write_table(batch)

I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example:

Table schema does not match schema used to create file: 
table:
...
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3121.0",
      "3136.0",
      ...
      "3170.0"
    ]
vs
file:
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3183.0",
      "3125.0",
      ...
      "3199.0"
    ]
MathMagique commented 5 years ago

@xhochy Does this look familiar?

xhochy commented 5 years ago

Yes, this look familiar. We have not yet implemented functionality to merge dictionary encoded data into a unionized type: https://issues.apache.org/jira/browse/ARROW-554