coiled / dask-snowflake

Dask integration for Snowflake
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Some queries return JSONResultBatch result type, causing errors #21

Open noyoshi opened 2 years ago

noyoshi commented 2 years ago

It appears some queries return results as a JSONResultBatch, which causes issues as the connector assumes the format will be arrow.

Specific error:

RuntimeError: Received unknown result batch type <class 'snowflake.connector.result_batch.JSONResultBatch'>

The query in this case was SHOW TABLES;. Not sure if snowflake makes it clear what types of results we can get, if so we can add some validation to the connector to check the different types perhaps? (More than just arrow)

Validated that a normal query would return the proper result SELECT * FROM TABLE LIMIT 100, which worked.

jrbourbeau commented 2 years ago

Thanks @noyoshi for trying dask-snowflake out and raising an issue. Indeed it appears Snowflake will return JSONResultBatch objects in some scenerios. Perhaps @sfc-gh-mkeller has thoughts on how we should handle both JSONResultBatch and ArrowResultBatch objects. For example, today we're using the ArrowResultBatch.to_pandas method to retrieve a pandas DataFrame for the underlying batch -- what is the recommended way of doing this for JSONResultBatch?

sfc-gh-mkeller commented 2 years ago

So this is caused by the fact that Snowflake cannot always provide results in Arrow format. Currently only SELECT statements can generate Arrow result formats. Which is the only thing that our connector supports turning into data frames. One workaround would be to execute your query and then use a result scan on its results, like:

with snowflake.connector.connect(
    ...
) as cnx:
    with cnx.cursor() as cur:
        cur.execute(f"SHOW TABLES")
        cur.execute(f"select * from table(result_scan('{cur.sfqid}'));")
        print(cur.fetchall())
jrbourbeau commented 2 years ago

Thanks for the additional context @sfc-gh-mkeller -- I hadn't realized that only SELECT statements were supported for turning results into DataFrames.

I could be missing something, but it looks like the snippet you posted might cause the entire query result to be loaded into memory -- is that the case?

Is there a standard way to convert a JSONResultBatch into a JSON blob? I'm wondering if we could convert these result batches to JSON and then use pandas' read_json function to convert into a DataFrame

sfc-gh-mkeller commented 2 years ago

That's right, @jrbourbeau The snippet was only to show you how to work around this limitation. I didn't mean for you to use it that way exactly, my apologies!

Not currently, unless you could convert the Python results yourself into DataFrames. I have never used read_json, but if it's that simple then you could try patching the following function: https://github.com/snowflakedb/snowflake-connector-python/blob/ffdd6b3339aa71885878d047141fe9a77c4a4ae3/src/snowflake/connector/result_batch.py#L525

I'd imagine that this might return different data types than what our Arrow converter does, but if this works at all I'd be happy to get us to take over and add a way to do this natively in our connector.