TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
9 stars 2 forks source link

Improve prefix checking #359

Open gaurav opened 1 month ago

gaurav commented 1 month ago

We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in reports/Gene.txt are:

frozenset({('RGD', 1), ('NCBIGENE', 1)})    22410
frozenset({('RGD', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})    24805
frozenset({('MGI', 1), ('NCBIGENE', 1)})    25522
frozenset({('NCBIGENE', 1), ('WORMBASE', 1)})   28785
frozenset({('MGI', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})    32732
frozenset({('ENSEMBL', 1)}) 3364683
frozenset({('ENSEMBL', 1), ('NCBIGENE', 1)})    11464913
frozenset({('NCBIGENE', 1)})    44080138

Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.

This issue proposes another way of getting this information:

We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.

gaurav commented 1 month ago

Here's where we're at as of 664f0d2dd5b9b8f2211351348e98e963e355c195 (in PR #363): prefix_report.json

The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.

gaurav commented 1 month ago

Fixed some bugs and here's where we're at as of ba62bd9cc894ae7c315cf44d11ff99d96d91b61a (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).

gaurav commented 1 month ago