dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
397 stars 72 forks source link

[ENH] Provide visibility into persisted tables and simple "restart" option #399

Open randerzander opened 2 years ago

randerzander commented 2 years ago

Sometimes it's not clear when running a number of SQL scripts in a session whether all scripts "clean up" after themselves (dropping temp tables, unpersisting tables, de-registering UDFs, etc).

It would be useful to support Dask-SQL features that:

  1. List all objects persisted to the underlying Dask cluster
  2. Unpersist all persisted tables (perhaps also with schema level granularity)
  3. Reset the Context to a clear state (all tables, UDFs, schemas dropped, any objects persisted by Dask-SQL are unpersisted)

In benchmarking work, we attempted to drop all tables manually with:

for table in list(c.schema["root"].tables.keys()):
    c.drop_table(table)

but (at least in dask-cudf) this didn't appear to successfully remove all references to persisted DataFrames. Memory use remained high, and only dropped after calling Dask's client.restart().

For #3, the user can always create a new Context, but its not clear if this incurs the overhead of restarting the underlying JVM (cc @jdye64 for comment).

jdye64 commented 2 years ago

@randerzander resetting the Context would not cause another JVM instance to be started. In fact the jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql .... is called

For reference ....

>>> import dask_sql
Starting JVM from path /home/jdyer/anaconda3/envs/dask-sql/lib/server/libjvm.so...
...having started JVM
>>> c1 = dask_sql.Context()
>>> c2 = dask_sql.Context()
randerzander commented 2 years ago

jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql .... is called

thanks for confirming