JohnOmernik / sqlalchemy-drill

Apache Drill Dialect for SQL Alchemy
Other
53 stars 35 forks source link

1.1.0 - Rewrite the Drill DB-API implementation using ijson #69

Closed jnturton closed 3 years ago

jnturton commented 3 years ago

This is basically a rewrite of _drilldbapi.py.

  1. Streaming query result parsing using ijson. Clients can start processing rows before the server has finished sending them and no longer need to buffer the entire dataset in memory.
  2. Remove Pandas and Numpy which were out of place in this layer.
  3. Lots of cleaning up.

This pairs nicely with the streaming JSON serialisation in Drill 1.19 and combined they should make fetching data from Drill to Python a good deal more efficient and reliable, at least for people who didn't go the JDBC or ODBC routes.

Log from a short session where I run SELECT * on a remote 17bn record table over a 0.5 Mbit/s link and start receiving (a steady trickle of) rows in seconds.

In [4]: r = engine.execute('select count(*) from dfs.ws.big_table')
INFO:drilldbapi:received Drill query ID 1f211888-cc20-fe6f-69d1-6584c5caa2df.
INFO:drilldbapi:opened a row data stream of 1 columns.

In [5]: next(r)
Out[5]: (17437571247,)

In [6]: r = engine.execute('select * from dfs.ws.big_table')
INFO:drilldbapi:received Drill query ID 1f211838-73df-1506-a74e-f5695f6b0ff5.
INFO:drilldbapi:opened a row data stream of 21 columns.

In [7]: while True:
   ...:     _ = next(r)
   ...:
INFO:drilldbapi:streamed 10000 rows.
INFO:drilldbapi:streamed 20000 rows.
INFO:drilldbapi:streamed 30000 rows.
INFO:drilldbapi:streamed 40000 rows.