Performance against dask read_sql_table

blue-yonder / turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

http://turbodbc.readthedocs.io/en/latest

MIT License

623 stars 85 forks source link

Performance against dask read_sql_table #263

Open argenisleon opened 4 years ago

argenisleon commented 4 years ago

Hi,

Is there any benchmark against pandas or dask?. I am thinking about using turbodbc in https://github.com/ironmussa/optimus to move data from databases to cudf and dask-cudf?

Any idea?

xhochy commented 4 years ago

Have a look at @MathMagique presentation starting at 20:00 http://2017.de.pycon.org/schedule/talks/turbodbc-turbocharged-database-access-for-data-scientists/

The PEP-249 performance should be roughly similar to the pandas.read_sql_table performance. There you can see what the performance differences are. Going to cudf which the Turbodbc Arrow Adapter might be even more efficient as you should be able to avoid the roundtrip through pandas as cudf also uses Arrow as its memory layout.

dhirschfeld commented 4 years ago

I did a comparison against mssql/sqlalchemy, fetching 1e6 records from a SQL Server database and got a 6x speedup with turbodbc:

...and that includes the cost of converting to pandas. Plans are to try and avoid that overhead with fletcher.

xhochy commented 4 years ago

Thanks for doing this @dhirschfeld !

argenisleon commented 4 years ago

Amazing talk @MathMagique , and thanks for the info @xhochy @dhirschfeld.

Internally dask uses the table index to parallelize the data reading. Any idea on how this could play with turbodbc? Could be any gain in using dask for this?