COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
57 stars 25 forks source link

Explorer tools performance #305

Open aidanheerdegen opened 2 years ago

aidanheerdegen commented 2 years ago

The time it takes to load the Database Explorer seems to have increased quite markedly recently.

import cosima_cookbook as cc
session = cc.database.create_session(db='/g/data/ik11/databases/cosima_master_2022-07-01.db')

%%time
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
CPU times: user 1min 16s, sys: 16 s, total: 1min 32s
Wall time: 4min 16s

Previous timings when the explorer was retooled because it was taking a minute to start up suggests this is a recent, and significant, slow down

https://github.com/COSIMA/cosima-cookbook/pull/274#issuecomment-956851439

TL;DR it used to take 17s

angus-g commented 2 years ago

I don't know if it's a red herring, but I'm still getting 17s from the database on tmpfs (disk latency seems important...), after changing:

diff --git a/cosima_cookbook/querying.py b/cosima_cookbook/querying.py
index 8ff2e47..05cf3c1 100644
--- a/cosima_cookbook/querying.py
+++ b/cosima_cookbook/querying.py
@@ -210,7 +210,7 @@ def get_variables(
         q = q.outerjoin(subq, subq.c.ncvar_id == NCVar.id)

     q = q.order_by(NCFile.frequency, CFVariable.name, NCFile.time_start, NCFile.ncfile)
-    q = q.group_by(CFVariable, NCFile.frequency)
+    q = q.group_by(CFVariable.id, NCFile.frequency)

     if experiment is not None:
         q = q.group_by(subq.c.value)

Before the change was about 30s:

before:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
32.3 s ± 559 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

after:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
17.1 s ± 60.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

I think any latency beyond that is down to serving the database over lustre, which is unavoidable...

aidanheerdegen commented 2 years ago

Nice detective work, and a good idea to take the disk latency out of the equation. I don't know why it used to be so much faster. It was on the OOD, so perhaps the IO was being saturated. I know that it has always been slower on the cloud infrastructure. Now I am intrigued by the option of loading the DB into memory to improve performance.

A couple of quick tests on gadi suggest a fair amount of variability, and still decently long disk access time

$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    2m38.213s
user    0m56.782s
sys     0m21.232s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    1m30.493s
user    0m50.484s
sys     0m19.030s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    2m36.051s
user    0m59.151s
sys     0m21.248s

Second was fast because it was still in cache. Third was after ~30min break, so similar to first.

Sounds like that code change is worth making if it halves the time.