INWTlab / dbrequests

python package built for easy use of raw SQL within python and pandas projects
MIT License
5 stars 2 forks source link

dbrequests.mysql: Performance of reading data from DB #22

Closed wahani closed 4 years ago

wahani commented 4 years ago

Current issue

With the recent release of 1.3.9 we can now send data to a db with far more than 20 million rows. Even 100s of million are possible. However reading the same data (20 million rows x 5 cols to have some reference) back in is not possible on the same machine. Before I introduced the datatable package, sending that amount of data to a DB needed 8GB of RAM. The 16GB RAM of my computer are not sufficient to read the same amount back into Python.

Currently in use for fetching and querying data are sqlalchemy, pymysql and pandas. Every one of them add some form of inefficiency to the problem. I experimented quite a lot with different scenarios. Given that pulling the 20 million lines takes ~50s in R (all local on a laptop), without any special settings required and without any memory problems, I was surprised to see that the same task is not even feasible in Python, by default.

Solution

Unfortunately the solution is to avoid all three packages: sqlalchemy, pymysql and pandas; and instead use datatable and mysqlclient. To have a compelling point here: we can reduce the time needed to 20s in Python with 1.3 GB memory (both better then what we have in dbtools). Maybe even more would be possible, but this is already a substantial improvement to the current situations.

20 million is just something I used to experiment. The actual amount for which we need a solution will be between 100-200 million rows. So all of the points are relevant to me.

Strategy

Pros

Cons

wahani commented 4 years ago

Cons

  • We break the contract in the API introduced by sqlalchemy. I found a way to make the changes as little invasive to the base class as possible.
    • However, providing a URL won't work anymore, that is a sqlalchemy concept.
    • Everything else provided by sqlalchemy can be preserved / covered
  • We break the contract introduced by pandas.
    • Arguments passed to the read_sql function can not remain the same. Instead they are now passed down to the mysqlclient.connection.cursor.execute function.
    • The return value of send_query is then a datatable.Frame.
    • At least that object provides a to_pandas method, so it is easily fixed.
  • mysqlclient requires additional system dependencies. This is a serious limitation if anyone planned on using this package on Windows. In that case one has to fall back to the base class and work with the default drivers and pymysql; but the mysql module of the package is linux only.

I tried to address these cons #23 and think almost all of them are taken care of. The one exception:

phainom commented 4 years ago

I still have to look at #23 and I know you managed to work around the main cons here, but let me just add that a lot of people try to avoid using pandas in large scale production pipelines, one of the biggest reasons being constraints introduced by memory consumption due to pandas loading everything into memory by design.

The merging of boilerplating around sqlalchemy with pandas and its to_sql functionality is the core feature of dbrequests, though. From my perspective, if a database connector without the constraints of pandas is needed because projects using SQL in python without pandas are conducted by INWT, a separate package designed for this would make more sense. dbrequests was mainly designed for usage in a project heavily relying on pandas (I dont know if that architectural decision was reverted or not).