Open gaurav opened 1 month ago
Here's where we're at as of 664f0d2dd5b9b8f2211351348e98e963e355c195 (in PR #363): prefix_report.json
The clique count doesn't line up with Babel 1.8 (it is VERY close: NodeNorm Dev has 476,991,762 cliques while we report 477,004,080 -- a difference of 12,318 cliques, which seems suspiciously small), so there might be some kind of bug in how this generated. I will continue to poke.
Fixed some bugs and here's where we're at as of ba62bd9cc894ae7c315cf44d11ff99d96d91b61a (in PR #363): prefix_report.json; we still have a clique count of 477,004,080 but now the CURIE count is close to the right answer too (NodeNorm Dev has 664,316,676 CURIEs while we report 664,529,929, a difference of 213,253).
for_clique/by_file
entry, we should count how often we get the LEADER and how often we get a SECONDARY ID with the same prefix. Tricky, but should be doable.
We currently handle a prefix-based report where we report on the prefix-based composition of each clique -- for example, the final rows in
reports/Gene.txt
are:Unfortunately, this isn't easy to compare between different runs, and doesn't really tell us e.g. how many NCBIGene identifiers we have in total, whether we ever have a clique with multiple NCBIGene identifiers, or provide us with something we can compare between different runs.
This issue proposes another way of getting this information:
We should then be able to come up with a script that can compare this file between two runs and let us know how things are changing.