dbrequests.mysql: Performance of reading data from DB

wahani commented 4 years ago

Current issue

With the recent release of 1.3.9 we can now send data to a db with far more than 20 million rows. Even 100s of million are possible. However reading the same data (20 million rows x 5 cols to have some reference) back in is not possible on the same machine. Before I introduced the datatable package, sending that amount of data to a DB needed 8GB of RAM. The 16GB RAM of my computer are not sufficient to read the same amount back into Python.

Currently in use for fetching and querying data are sqlalchemy, pymysql and pandas. Every one of them add some form of inefficiency to the problem. I experimented quite a lot with different scenarios. Given that pulling the 20 million lines takes ~50s in R (all local on a laptop), without any special settings required and without any memory problems, I was surprised to see that the same task is not even feasible in Python, by default.

Solution

Unfortunately the solution is to avoid all three packages: sqlalchemy, pymysql and pandas; and instead use datatable and mysqlclient. To have a compelling point here: we can reduce the time needed to 20s in Python with 1.3 GB memory (both better then what we have in dbtools). Maybe even more would be possible, but this is already a substantial improvement to the current situations.

pandas: problematic is the use of memory required for the task. Hard to track down how much is due to pandas, but I assume it is at least the 4GB for holding the result.
sqlalchemy: mostly CPU time. A result set is wrapped into a ResultProxy class. Every row (20+million) is additionally wrapped into a RowProxy class. Every row has to be unwrapped to convert them into a DataFrame/DataTable and this requires mostly CPU time and on more copy of the data. The effect seems to be linear with the number of rows and would be okay for 20 million, but will become a problem at larger scale.
pymysql: is the actual bottleneck. It does not take long, to consume all available memory. But that is what is happening. I tried to apply the same strategies as in the solution I found, with no effect. Either the memory explodes, or we have to wait a really long time: 5+min (didn't wait for it to finish).

20 million is just something I used to experiment. The actual amount for which we need a solution will be between 100-200 million rows. So all of the points are relevant to me.

Strategy

we use Server Side Cursors. This means we leave the result set on the server, and pull down the data. This can be part of the solution to the memory consumption. This is also available in pymysql but very slow. It is fast in mysqlclient.
we fetch 100k rows at a time (maybe as a parameter to the function). And store the results in a Frame. This is the other half of solving the memory problem. Storing the tuples / RowProxies consumes a lot of memory. A Frame seems to be very efficient in that regard.

Pros

S P E E D

Cons

We break the contract in the API introduced by sqlalchemy. I found a way to make the changes as little invasive to the base class as possible.
- However, providing a URL won't work anymore, that is a sqlalchemy concept.
- Everything else provided by sqlalchemy can be preserved / covered
We break the contract introduced by pandas.
- Arguments passed to the read_sql function can not remain the same. Instead they are now passed down to the mysqlclient.connection.cursor.execute function.
- The return value of send_query is then a datatable.Frame.
- At least that object provides a to_pandas method, so it is easily fixed.
mysqlclient requires additional system dependencies. This is a serious limitation if anyone planned on using this package on Windows. In that case one has to fall back to the base class and work with the default drivers and pymysql; but the mysql module of the package is linux only.

wahani commented 4 years ago

Cons

We break the contract in the API introduced by sqlalchemy. I found a way to make the changes as little invasive to the base class as possible.

However, providing a URL won't work anymore, that is a sqlalchemy concept.

Everything else provided by sqlalchemy can be preserved / covered

We break the contract introduced by pandas.

Arguments passed to the read_sql function can not remain the same. Instead they are now passed down to the mysqlclient.connection.cursor.execute function.

The return value of send_query is then a datatable.Frame.

At least that object provides a to_pandas method, so it is easily fixed.

mysqlclient requires additional system dependencies. This is a serious limitation if anyone planned on using this package on Windows. In that case one has to fall back to the base class and work with the default drivers and pymysql; but the mysql module of the package is linux only.

I tried to address these cons #23 and think almost all of them are taken care of. The one exception:

Arguments passed to the read_sql function can not remain the same. Instead they are now passed down to the mysqlclient.connection.cursor.execute function.

phainom commented 4 years ago

I still have to look at #23 and I know you managed to work around the main cons here, but let me just add that a lot of people try to avoid using pandas in large scale production pipelines, one of the biggest reasons being constraints introduced by memory consumption due to pandas loading everything into memory by design.

The merging of boilerplating around sqlalchemy with pandas and its to_sql functionality is the core feature of dbrequests, though. From my perspective, if a database connector without the constraints of pandas is needed because projects using SQL in python without pandas are conducted by INWT, a separate package designed for this would make more sense. dbrequests was mainly designed for usage in a project heavily relying on pandas (I dont know if that architectural decision was reverted or not).

INWTlab / dbrequests

dbrequests.mysql: Performance of reading data from DB #22