Open argenisleon opened 4 years ago
Have a look at @MathMagique presentation starting at 20:00 http://2017.de.pycon.org/schedule/talks/turbodbc-turbocharged-database-access-for-data-scientists/
The PEP-249 performance should be roughly similar to the pandas.read_sql_table
performance. There you can see what the performance differences are. Going to cudf
which the Turbodbc Arrow Adapter might be even more efficient as you should be able to avoid the roundtrip through pandas
as cudf also uses Arrow as its memory layout.
I did a comparison against mssql/sqlalchemy
, fetching 1e6 records from a SQL Server database and got a 6x speedup with turbodbc
:
...and that includes the cost of converting to pandas
. Plans are to try and avoid that overhead with fletcher
.
Thanks for doing this @dhirschfeld !
Amazing talk @MathMagique , and thanks for the info @xhochy @dhirschfeld.
Internally dask uses the table index to parallelize the data reading. Any idea on how this could play with turbodbc? Could be any gain in using dask for this?
Hi,
Is there any benchmark against pandas or dask?. I am thinking about using turbodbc in https://github.com/ironmussa/optimus to move data from databases to cudf and dask-cudf?
Any idea?