delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

Confusing "Cast Error: cannot cast string to Int64" error, even when field is a string #2557

Open liamphmurphy opened 1 month ago

liamphmurphy commented 1 month ago

Environment

Delta-rs version: v0.16.4

Binding: python

Environment:


Bug

What happened:

We had a delta table with a schema that looked something like this (some names omitted due to data privacy, also excuse the poor indenting):

{
     "id": "string",
     "source": "string",
     "properties": {
            "location": {"type": "string"},
            "results": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "resultId": {"type": "string"},
                         "relevancyScore": {"type": "number"} ## NEW FIELD
                    }
                }
            },
            "resultsReturnedQuantity": {"type": "integer"},
            "webContext": {
                "type": "object",
                "properties": {
                    "id": {"type": "string"},
                    "userId": {"type": "string"}
                    },
                    "page": {
                        "type": "object",
                        "properties": {
                            "pageType": {"type": "string"},
                            "postId": {"type": "string"},
                            "referrer": {"type": "string"},
                            "revisionId": {"type": "string"},
                            "textFragment": {"type": "string"},
                            "title": {"type": "string"},
                            "url": {"type": "string"}
                        }
                    },
                    "sessionId": {"type": "string"},
                    "timezoneOffset": {"type": "integer"},
                    "userAgent": {"type": "string"},
                    "filter": {"type": "array", "items": {"type": "string"}} ## NEW FIELD
                }
            }
        }
      }
}

Note that I highlighted two fields as new, these fields were added and were attempted to rectify via a merge. The error discussed below occured when calling write_deltalake with the rust engine and schema_mode=merge.

Due to concerns around memory usage, we use the pyarrow engine on writes (we're still looking to switch this over to the rust engine entirely). However, if we get an error saying the schema data does not match whats in the table, we will fallback to a rust write with schema_mode=merge. So maybe that info is useful, that we use pyarrow first and then the rust engine? But we've merged several times without issues in the past.

There were two fields added as mentioned above. After this update, this following error occurred Cast error: Cannot cast string 'resultId value' to value of Int64 type. The confusing part is that this error was happening on the resultId field, which already existed.

What you expected to happen: I wouldn't have expected an error, but if it did occur, on one of the two new fields that got added and not an existing field where the type didn't change.

How to reproduce it:

TBD, going to try and reproduce this locally.

EDIT: no luck yet 🤷

More details:

I was able to solve this by opening a PySpark session and running an ALTER TABLE to add the columns.

liamphmurphy commented 3 weeks ago

^ edited to add some new details. I've attempted to do this locally (with the pyarrow write first, then the changes to the schema, and a rust merge write) but so far it's worked locally as expected.