INWTlab / dbrequests

python package built for easy use of raw SQL within python and pandas projects
MIT License
5 stars 2 forks source link

Slow and Memory Hungry send_data for mysql Module #20

Closed wahani closed 4 years ago

wahani commented 4 years ago

In the dialect specific implementation for MySQL, we gain some performance by using the LOAD DATA LOCAL INFILE strategy. This already sped up the runtime by a factor of 10 in some of my tests. However, with medium sized data (>20mio rows) the memory footprint of pandas is a real problem. To write a dataset with 2.4 GB of size in RAM, we need roughly 8GB to write it to disk. As with R, we have an alternative, which is optimized for speed: datatable. With datatable and for the same data, only 800MB is used in RAM, and no additional RAM is needed to write to disk. It is also much faster for writing the data to csv (what we do in the mysql module).

So the idea is simple. We maintein the API of the send_data method, and replace the backend to use datatable. A PR will follow shortly.

A caveat is that it is not quite clear how stable the API already is. We have to keep an eye on that, but right now, I am happy to accept this and have something feasible to write some not so large data to a db with appropriate hardware resources...