Improve prefix checking

gaurav commented 1 month ago

We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in reports/Gene.txt are:

frozenset({('RGD', 1), ('NCBIGENE', 1)})    22410
frozenset({('RGD', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})    24805
frozenset({('MGI', 1), ('NCBIGENE', 1)})    25522
frozenset({('NCBIGENE', 1), ('WORMBASE', 1)})   28785
frozenset({('MGI', 1), ('ENSEMBL', 1), ('NCBIGENE', 1)})    32732
frozenset({('ENSEMBL', 1)}) 3364683
frozenset({('ENSEMBL', 1), ('NCBIGENE', 1)})    11464913
frozenset({('NCBIGENE', 1)})    44080138

Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.

This issue proposes another way of getting this information:

We create a JSON file as a dictionary with every prefix in it.
For every prefix, we determine:
- The total number of CURIEs in the system with that prefix (total_curies)
- The total number of unique CURIEs in the system with that prefix (total_unique_curies -- if not identical to total_curies, it means some duplication is going on).
- The total number of cliques containing this prefix (total_cliques_containing_prefix -- if equal to total_unique_curies, then there's exactly one identifier in each clique)
- The files this prefix is present in

We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.

gaurav commented 1 month ago

Here's where we're at as of 664f0d2dd5b9b8f2211351348e98e963e355c195 (in PR #363): prefix_report.json

The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.

gaurav commented 1 month ago

Fixed some bugs and here's where we're at as of ba62bd9cc894ae7c315cf44d11ff99d96d91b61a (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).

gaurav commented 1 month ago

[ ] Ooo, for every for_clique/by_file entry, we should count how often we get the LEADER and how often we get a SECONDARY ID with the same prefix. Tricky, but should be doable.

TranslatorSRI / Babel

Improve prefix checking #359