What happened:
It is relatively trivial to connect dask-sql's Presto server to Apache Superset for some basic visualization. Things seem to be working fine for Pandas-backed tables, but when attempting to run any queries with cuDF-backed tables we run into a TypeError:
Traceback (most recent call last):
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
return await self.app(scope, receive, send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
await super().__call__(scope, receive, send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
await self.middleware_stack(scope, receive, send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
raise exc
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
await self.app(scope, receive, _send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
raise exc
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
await self.app(scope, receive, sender)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
await route.handle(scope, receive, send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
await self.app(scope, receive, send)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
response = await func(request)
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
raw_response = await run_endpoint_function(
File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/routing.py", line 159, in run_endpoint_function
return await dependant.call(**values)
File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/app.py", line 61, in status
return DataResults(df, request=request)
File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/responses.py", line 115, in __init__
self.data = self.get_data_description(df)
File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/responses.py", line 82, in get_data_description
for row in df.itertuples(index=False, name=None)
File "/raid/charlesb/dev/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 6298, in itertuples
raise TypeError(
TypeError: cuDF does not support iteration of DataFrame via itertuples. Consider using `.to_pandas().itertuples()` if you wish to iterate over namedtuples.
What you expected to happen:
I would expect to be able to use cuDF-backed tables in Superset the same as Pandas-backed tables.
Minimal Complete Verifiable Example:
To set up a minimal Superset env to run queries out of:
# mostly lifted from https://superset.apache.org/docs/installation/installing-superset-from-scratch
mamba create -n superset python=3.8 pyhive
mamba activate superset
pip install apache-superset
superset db upgrade
export FLASK_APP=superset
superset fab create-admin
superset init
superset run -p 8088 --with-threads --reload --debugger
From there we can create a trivial dask-cuDF dataframe and serve it via Presto:
import dask_cudf
from dask.datasets import timeseries
from dask_sql import Context
df = timeseries()
ddf = dask_cudf.from_dask_dataframe(df)
c = Context()
c.create_table("timeseries", ddf)
c.run_server()
Connecting to this database and running any basic queries will result in the above traceback.
Anything else we need to know?:
This looks like a pretty trivial failure which isn't surprising since we haven't done much GPU work on the server end of things. The trivial solution here would be to conditionally call to_pandas on the dataframe before attempting itertuples, though I feel like there could be better ways to work around this.
312 addresses this, but isn't an ideal solution IMO due to the forced host synchronization; it may be worthwhile to explore different SQL server options that allow for different response types that don't require us to iterate over table rows.
What happened: It is relatively trivial to connect dask-sql's Presto server to Apache Superset for some basic visualization. Things seem to be working fine for Pandas-backed tables, but when attempting to run any queries with cuDF-backed tables we run into a
TypeError
:What you expected to happen: I would expect to be able to use cuDF-backed tables in Superset the same as Pandas-backed tables.
Minimal Complete Verifiable Example: To set up a minimal Superset env to run queries out of:
From there we can create a trivial dask-cuDF dataframe and serve it via Presto:
Connecting to this database and running any basic queries will result in the above traceback.
Anything else we need to know?: This looks like a pretty trivial failure which isn't surprising since we haven't done much GPU work on the server end of things. The trivial solution here would be to conditionally call
to_pandas
on the dataframe before attemptingitertuples
, though I feel like there could be better ways to work around this.cc @bryevdv @randerzander