TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Implement very memory-efficient clique building #143

Open gaurav opened 1 year ago

gaurav commented 1 year ago

@jeffhhk has a very memory-efficient divide-and-conquer clique building algorithm in his head. If we incorporate it into Babel, we might not need 500G of memory to run it, which would be nice.

gaurav commented 1 year ago

Since Babel intermediate files are in TSV format, @balhoff suggests that Souffle might be a good way to do this. This might also make it easy to add more Boomer-like rules if we want to do that in the future.

@jeffhhk used Chez Scheme (https://cisco.github.io/ChezScheme/) and ChatGPT to come up with divide-and-conquer clique building algorithm as promised. It works roughly like this:

  1. Allocate a chunk of memory for cliqueing. You can adjust this based on what your computer supports.
  2. Set up the memory such that each index is set to itself (x[0] = 0, x[1] = 1, etc.)
  3. Do a coarse hash of every pair of identifiers, by:
    1. Use a hash function (Python has one included) modulo the size of the memory chunk to convert each string into an index.
    2. Implement an algorithm that modifies the memory chunk such that it maintains a clique of hashes, e.g. to add (4, 1) to 4->2->3->5, change 4->1->2->3... -- there are four possible cases to consider here.
  4. Write every pair of identifiers into a file, but add an additional column that identifies the clique. We can identify a clique by arbitrarily picking the smallest index in that clique.
  5. Use /usr/bin/sort to sort this file by its first column. This produces a file containing "supercliques" -- because the hashing is coarse, we will need to split most if not all cliques in this file, identified by a unique clique ID in the first column. After sorting, the file will consist of a series of supercliques one after the other.
  6. Read the sorted file, but this time use a fine hashing algorithm -- I think that means don't do the modulo, or maybe just use a normal hashtable, but I'm not sure. This will take a lot more memory, BUT we know that every clique will be a subclique of the current clique, so we can use as much memory as needed to process a clique, then throw the hashtable away and move on to the next clique.
  7. This will give you a series of cliques with a set of identifiers for each.
jeffhhk commented 1 year ago

Well, I wrote some code in ChezScheme and translated it to python with the help of ChatGPT.

Gaurav informed me that the Babel build takes place on a 500GB server primarily because of clique building.

Here is a demo of building cliques on 200M edges in 1GB in python:

https://gist.github.com/jeffhhk/42bc2ea4c263eb4958220522201975ae

Currently, the fine cliques are printed on stdout, which more useful for debugging than for running whole files through it.

The demo prints the cliques as a list of nodes, but it sounds like there is some policy in the Node Normalizer that needs additional context to make policy decisions. @cbizon told me verbally something to the extent that those policy decisions are currently scattered about throughout clique building.

It seems to me that it should be possible to A) decorate each edge with metadata on the way in, and B) run a policy-free algorithm like the one demonstrated and C) emit edges and their metadata belonging to each clique. On that base, it seems reasonable to conjecture that whatever policy decisions needed can ported to postprocessing that output, after the RAM-intensive computation has been completed.

I hope that this kind of change can make it easier to run a Babel build and contribute improvements to enrich its information.

jeffhhk commented 10 months ago

Changing the coarse (first) pass of this algorithm to be based on union-find would take the same space and be faster. For your data size, I would guess about 4x faster. Union-find would also benefit the fine pass but not as much.