apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.95k stars 979 forks source link

Query failed when EXISTS used with UNNEST #2936

Closed egasimov closed 2 months ago

egasimov commented 2 months ago

Hello Drill community,

Recently, we have discovered apache drill and started to make some experiments on reading files and run the filter kinda queries over dataset.

So, my scenario is as follows:

I have a parquet file whose contents as follows:(customers with the purchased items) When to try to find the customers who purchased the following items whose product_id is from the given list (777,888)

The following query sent to drill server:

{
    "queryType": "SQL",
    "query": "SELECT *  FROM dfs.root.`/datas3/customers/*` d WHERE EXISTS ( SELECT 1 FROM UNNEST(d.purchased_items) AS s WHERE  s.item_id in (777))"
}

Getting the following error: VALIDATION ERROR: From line 1, column 118 to line 1, column 124: Column 'item_id' not found in table 's'

Additional Context

[

{
    "customer_id": 1000001,
    "customer_name": "John Doe",
    "purchased_items" : [
        {
            "item_id": 2000001,
            "item_class": "A",
            "product_id" : 777,
            "created_at": "2024-06-12T11:36:37.751Z"
        },
        {
            "item_id":2000002,
            "item_class": "B",
            "product_id" : 888,
            "created_at": "2024-06-12T11:46:37.751Z"
        },        
        {
            "item_id":2000003,
            "item_class": "C",
            "product_id" : 999,
            "created_at": "2024-06-12T11:46:37.751Z"
        }
    ]
}

]
egasimov commented 2 months ago

After analyzing the source code for the UNNEST operator, we have observed the following pattern need to be used with the UNNEST operator.

{
    "queryType": "SQL",
    "query": "SELECT d.customer_id, d.purchased_items  FROM dfs.root.`/datas3/customers/*` d WHERE EXISTS ( SELECT 1 FROM UNNEST(d.purchased_items) t2(ord) WHERE  t2.ord.item_id in (2000001))"
}
cgivre commented 2 months ago

@egasimov You beat me to it, but I don't think UNNEST is the operator you want. UNNEST performs a LATERAL JOIN automatically, even when not specified. I think FLATTEN is probably the operator you want to use.

Also, I'd recommend NOT using dot notation to reference inner fields in maps. IE:

Don't use

SELECT s.id

Instead use

SELECT s['id']
egasimov commented 2 months ago

Hey @cgivre , Thank you for response. I have really appreciated it.

Requirement for the SQL query is to check and filter the those rows whose nested array field(_purchaseditems) have at least one of the values provided in the IN clause and return the original rows(without duplicating the entries).

When FLATTEN used in the query, after the filter operations, we will again need to GROUP BY or DISTINCT to remove duplicate rows form result.

Wdyt, Is there any other way to accomplish the same goal but in optimized way ? :thinking: