apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.67k stars 1.06k forks source link

`FieldNotFound` but field is shown in `valid_fields` #6699

Open MidasLamb opened 1 year ago

MidasLamb commented 1 year ago

Describe the bug

I'm doing a query with aliases, and when trying to execute it, it gives back a SchemaError saying FieldNotFound, however that error also contains a list of valid fields, and the field it's saying isn't found, is listed in there. The error (formatted):

SchemaError(FieldNotFound {
field:
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "776ff32c7d7bfb1c1a49d5c0d94e0db8" },
valid_fields: [
    Column { relation: Some(Bare { table: "75620d04a8b7b34c70c68f7af68953af" }), name: "qid" },
    Column { relation: Some(Bare { table: "75620d04a8b7b34c70c68f7af68953af" }), name: "timestamp" },
    Column { relation: Some(Bare { table: "75620d04a8b7b34c70c68f7af68953af" }), name: "createdDate" },
    Column { relation: Some(Bare { table: "75620d04a8b7b34c70c68f7af68953af" }), name: "82ad7656c6b6be56e9abfac772c89b4c" },
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "dataScope" },
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "qid" },
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "timestamp" },
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "createdDate" },
    Column { relation: Some(Bare { table: "linkedtwintag" }), name: "776ff32c7d7bfb1c1a49d5c0d94e0db8" }
] })

As you can see, the field that is not found is the last one in the valid_fields list.

To Reproduce

Create two mysql tables, and use the following crate to pull them in: https://github.com/splitgraph/seafowl/tree/main/datafusion_remote_tables.

Execute the following (or similar query):

SELECT "75620d04a8b7b34c70c68f7af68953af".*
FROM "75620d04a8b7b34c70c68f7af68953af" 
JOIN "f4be58656cd5e54b9bffa1706c3c49ff" AS "linkedTwintag" ON "75620d04a8b7b34c70c68f7af68953af"."82ad7656c6b6be56e9abfac772c89b4c" = "linkedTwintag"."776ff32c7d7bfb1c1a49d5c0d94e0db8" 
WHERE 1 = 1  AND "linkedTwintag"."createdDate" = ?  AND "linkedTwintag"."qid" = ?  ORDER BY "75620d04a8b7b34c70c68f7af68953af"."qid" ASC"

Expected behavior

I get back a normal resultbatch.

Additional context

I've not used datafusion a lot, so I'm not 100% sure whether its a bug in the crate for pulling in mysql tables, or if its a bug in datafusion itself.

MidasLamb commented 1 year ago

I'm digging what deeper and it seems to be caused by the capitalization inside of "linkedTwintag", as putting some dbg! statements everywhere, I see that at some point it compares

Some(
        Bare {
            table: "linkedTwintag",
        },
    ),
    Some(
        Bare {
            table: "linkedtwintag",
        },
    ),

which results in it not being a match

alamb commented 1 year ago

I saw something similar the other day: https://github.com/apache/arrow-datafusion/issues/6790

i wonder if it is possible to create a datafusion only reproducer (with the datafusion-cli for example)