TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Develop an RDBMS schema for Babel outputs #281

Open gaurav opened 1 month ago

gaurav commented 1 month ago

We've standardized on a JSONL output format for Babel outputs (including compendia and conflation files for NodeNorm, synonym files for NameRes, and KGX and SSSOM for other outputs). We also have a particular database schema used by NodeNorm's Redis databases that is specifically designed to be as optimized for identifier lookups as possible.

However, there are three sets of applications where having an RDBMS-based view of Babel would be useful to have:

  1. Index-wide tests (#225), where we e.g. look for multiple cliques having the same preferred name to investigate whether we are not cliquing identifiers properly. This is particularly important right now, since we need to look into reducing the number of cliques in chemical entries for Sapbert and in identifying and improving the substandard preferred chemical names currently in the system.
  2. One constraint on people outside of RENCI developing on Babel is the difficulty of setting up a 500G memory system that can hold the full set of protein (and, soon, chemical) identifier cliques in memory at once (#143). If we were to use an RDBMS for this -- i.e. we load the identifier pairs into an RDBMS, then run a series of SQL queries to pull out one clique at a time under a particular constraint -- we might be able to get around this limit.
  3. We could use this RDBMS to store information that we can't fit into NodeNorm or NameRes, such as provenance (#205). To be fair, we could also include this information in the intermediate files we currently generate as part of NodeNorm, but sticking them into a standard database format might make it easier to make it accessible to users of Translator.
    • The part of this that would be most useful in the short term is that the synonym format we currently use ignores provenance, so working backwards to understand where a particular synonym in NameRes came from is pretty hard right now.

All of this would benefit with sketching out an RDBMS schema that would support all of these applications and that might someday be used to store the cliques before they are exported in all the Babel output formats.

gaurav commented 1 month ago

Here is an initial schema: