dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
386 stars 71 forks source link

[BUG] dask-sql-server + Superset not working for GPU tables #310

Open charlesbluca opened 2 years ago

charlesbluca commented 2 years ago

What happened: It is relatively trivial to connect dask-sql's Presto server to Apache Superset for some basic visualization. Things seem to be working fine for Pandas-backed tables, but when attempting to run any queries with cuDF-backed tables we run into a TypeError:

Traceback (most recent call last):
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/raid/charlesb/dev/rapids/compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/fastapi/routing.py", line 159, in run_endpoint_function
    return await dependant.call(**values)
  File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/app.py", line 61, in status
    return DataResults(df, request=request)
  File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/responses.py", line 115, in __init__
    self.data = self.get_data_description(df)
  File "/raid/charlesb/dev/rapids/dask-sql/dask_sql/server/responses.py", line 82, in get_data_description
    for row in df.itertuples(index=False, name=None)
  File "/raid/charlesb/dev/rapids/cudf/python/cudf/cudf/core/dataframe.py", line 6298, in itertuples
    raise TypeError(
TypeError: cuDF does not support iteration of DataFrame via itertuples. Consider using `.to_pandas().itertuples()` if you wish to iterate over namedtuples.

What you expected to happen: I would expect to be able to use cuDF-backed tables in Superset the same as Pandas-backed tables.

Minimal Complete Verifiable Example: To set up a minimal Superset env to run queries out of:

# mostly lifted from https://superset.apache.org/docs/installation/installing-superset-from-scratch

mamba create -n superset python=3.8 pyhive
mamba activate superset
pip install apache-superset

superset db upgrade

export FLASK_APP=superset
superset fab create-admin

superset init

superset run -p 8088 --with-threads --reload --debugger

From there we can create a trivial dask-cuDF dataframe and serve it via Presto:

import dask_cudf

from dask.datasets import timeseries
from dask_sql import Context

df = timeseries()
ddf = dask_cudf.from_dask_dataframe(df)

c = Context()

c.create_table("timeseries", ddf)
c.run_server()

Connecting to this database and running any basic queries will result in the above traceback.

Anything else we need to know?: This looks like a pretty trivial failure which isn't surprising since we haven't done much GPU work on the server end of things. The trivial solution here would be to conditionally call to_pandas on the dataframe before attempting itertuples, though I feel like there could be better ways to work around this.

cc @bryevdv @randerzander

charlesbluca commented 2 years ago

312 addresses this, but isn't an ideal solution IMO due to the forced host synchronization; it may be worthwhile to explore different SQL server options that allow for different response types that don't require us to iterate over table rows.