MariaDB / mariadb_kernel

A MariaDB Jupyter kernel
BSD 3-Clause "New" or "Revised" License
30 stars 21 forks source link

Make kernel capable of dealing with huge SELECTs #9

Open robertbindar opened 3 years ago

robertbindar commented 3 years ago

If you SELECT a huge number of rows, say 500k or 1M, it's probably not a very good idea to create such a huge DataFrame. I tested with 500k rows, and the DataFrame is not exactly the problem, it consumed around 500MB of memory, the problem is the amount of rows the browser needs to render (my Brave tab was taking 2GB to render 500k rows), so the Lab UI slows down considerably. I've also observed during tests that in some runs, pexpect times out whilst waiting for a data from the MariaDB client (in this case the timeout settings are too low).

A potential solution could be to introduce a new config option which will specify a limit for each SELECT statement, this should have a default value of something like 50k rows. There should also be a magic command that issues the SELECT and writes the output directly on disk, we want users to be able to chart large datasets.

The tricky part is to make the charting magic commands work efficiently if a large result set is written on disk.

KiaraGrouwstra commented 1 year ago

i ran into the timeout as well. looks like pexpect has timeout=30 defaults on their run functions and on their constructors. probably a good start would be to actually pass our own timeout variables (on our end fortunately defaulting to -1) to them.