TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Replace the in-memory glomming code with DuckDB databases #291

Open gaurav opened 3 weeks ago

gaurav commented 3 weeks ago

Loading the DrugChemicalConflated synonyms into DuckDB has been very successful: DuckDB could load the 6.2G gzipped file (containing 67G uncompressed JSON) in 1019 seconds (17 mins) on our 500G Babel instance on Sterling, which it was able to store it in a 15G DuckDB file or to export them into 6.1G Cliques and 6.1G Synonyms Parquet files. I can then run queries against them in a few minutes. Note that I haven't tried this on the largest file (Protein.txt, which is 8.7G gzipped) yet, but I assume that it will take a similar amount of time.

We could individually index all the synonym files and use that to create some index-wide tests (#225), but since DuckDB can spill over to disk if it runs out of memory, we should maybe take this one step further and use DuckDB to replace our current in-memory cliquing and glomming process. The idea is that we would create DuckDB databases containing all the information we now create with cliques and synonyms, including possibly some provenance information on where each edge came from, assuming we can fit all that in our in-development RDBMS schema (#281). We could then export from those databases into the current compendium/synonym format.

In the short term, this would allow us to implement flexible index-wide tests (#225), including tackling primary identifiers in multiple types (e.g. #276), and might allow people to run Babel even on very large compendia without needing the 500G node we currently need to run it at RENCI. This has been requested by @jeffhhk (#143), and implementing this with DuckDB is probably a more sustainable way of doing this than the optimized algorithm described in #143, and gives us some additional benefits.

In the medium term, this could be a pathway for us to provide provenance information across all cliques and synonyms (#205), which we can do by including provenance information in the exported Parquet files, which might be easier to work with than the compendium/synonym files (although perhaps they will end up being equally difficult to deal with).

In the long term, this could help with transitioning from our current JSON/Redis based deployment system to a Parquet/RDBMS/Redis deployment system, which might reduce hosting costs and end up with a more streamlined and easier to deploy NodeNorm.

The main reasons not to do this (yet) is that:

  1. The glomming code, although centralized, is somewhat complicated, and we'll want to make sure we're handling all the edge cases properly.
  2. One of the most useful optimizations we have going on is splitting up the output files by type -- this means that we don't have to build Protein/Gene/SmallMolecule if we're working on another compendium. If we combine everything into a single database, it'll be harder to split processing out in that way.