ESGF / esgf-download

ESGF data transfer and replication tool
https://esgf.github.io/esgf-download/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Improving database efficiency #44

Open meteorologist15 opened 3 months ago

meteorologist15 commented 3 months ago

My recent usage of esgpull has prompted me to officially raise an issue with regards to how efficiently (or inefficiently in this case) the database is utilized. Our carryover esgpull.db from the Synda days stands at around 2.3 GB in size. Starting this past Monday (2024-06-03), I have been attempting to download a fairly large grouping of data (~2.0 TiB; 60,000+ files), and after adding and tracking the query, the "update" process (esgpull update ) is STILL trudging along, with the database seemingly processing/validating/querying one file approximately every 60-90 seconds. If the pace continues, I'll be able to get through this "update" step in a little over a month from now, a very noticeable scaling issue. I'm not sure exactly how esgpull's "update" command utilizes/accesses the database, but I would presume that whatever algorithm is being used now should eventually be tuned to scale better with bigger requests and an already-packed esgpull.db.

svenrdz commented 1 month ago

Hi @meteorologist15 Apologies for the wait on this issue. I've just spent the day on the problem and found a few ways to improve the performance on updates. The fix currently sits in this PR: https://github.com/ESGF/esgf-download/pull/47 I'll make sure to push a new release as soon as possible, to both PyPI and conda.

meteorologist15 commented 1 month ago

Thank you for pushing this PR! I was wondering though, were there any changes made that could break my old esgpull.db? I just installed a brand new instance of esgpull, copied over my esgpull.db, and when I ran 'esgpull show', I got the following message:


DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: SELECT facet.name AS facet_name, facet.value AS facet_value, facet.sha AS facet_sha FROM facet, selection_facet WHERE ? = selection_facet.selection_sha AND facet.sha = selection_facet.facet_sha]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

I then tried to transfer over a copy of an older esgpull.db than I had made when I had just translated my synda.db over to an esgpull.db. That also failed with a similar message:


DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: SELECT count(?) AS count_1 FROM query_file JOIN file ON file.sha = query_file.file_sha WHERE query_file.query_sha = ?]

(Background on this error at: https://sqlalche.me/e/20/4xp6)

Any thoughts? Thanks!

meteorologist15 commented 1 month ago

Also, I see additional "esgpull.db-shm" and "esgpull.db-wal" files in my 'db' directory, in addition to the clean esgpull.db upon installation. Could you perhaps describe what they are for? Thanks!

meteorologist15 commented 1 month ago

Perhaps the 'LEGACY' parameter is tripping things up? (though it had worked before?)

From the log:

[2024-07-19 18:27:57] DEBUG root Locals: { 'self': <sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect_pysqlite object at 0x2af5617664b0>, 'cursor': <sqlite3.Cursor object at 0x2af561820bc0>, 'statement': 'SELECT count(?) AS count_1 \nFROM query_file JOIN file ON file.sha = query_file.file_sha \nWHERE query_file.query_sha = ?', 'parameters': ('*', 'LEGACY'), 'context': <sqlalchemy.dialects.sqlite.base.SQLiteExecutionContext object at 0x2af5617d5520> }