dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
386 stars 71 forks source link

Integrate with Google BigQuery to support large query results #188

Open NI1993 opened 3 years ago

NI1993 commented 3 years ago

I'm working in a project which involves with querying a Google Bigquery server.

Depending on the sql and the data it's being ran on the output result (table) may be too large to fit in memory.

It would be very nice feature to use dask-sql to query over a Bigquery server, and get the resulted table as a dask data frame.

I've managed to think of a workaround to fit a very large sql result in a dask data frame:

  1. prefix the sql with the export statement

  2. Read the output file (in cloud storage) using dask.

Let me know what do you think :)

Orisbaum commented 3 years ago

GREAT IDEA!

nils-braun commented 3 years ago

Hi @NI1993 - interesting! Thanks for bringing this up. So just to understand you correctly: you would like to run the SQL on big query and end up with a dask dataframe, correct?

Dask-SQL helps you with perfoming SQL statements on already present dask dataframes, so getting data from Big query would be somehow a step before that (and for Dask-SQL or would not be different from reading from e.g. files).

Did you have a look into https://github.com/dask/dask/issues/3121? Unfortunately, I think this issue has not been continued, but I would be super happy to build that together with your help. I think it might make sense to move this to a different new package (dask-bigquery) or something.

The reason why I would propose to not include it to Dask-SQL, is so that other users without dask-sql can also create dataframes from Big query (and as we pass on the SQL, we do not need to "understand" the SQL).

What do you think? I would be very happy to get your input!