TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

What proportion of NodeNorm concepts have information content values? #199

Open gaurav opened 1 year ago

gaurav commented 1 year ago

Asked by @MarkDWilliams at the Translator June Relay.

gaurav commented 1 year ago

I'm not sure if there's a good way to run this within Babel, but it can be calculated from the Babel outputs by running:

$ srun --mem=100G jq -r '[.type, .identifiers[0].i, .ic] | @tsv' *.txt > ic-values-all.tsv
gaurav commented 1 year ago

Out of 426,504,187 cliques, 423,470,395 cliques don't have information content values. Therefore, we have information content values for 3,033,792 cliques. This means that we use most of the 3,600,656 identifiers that we downloaded from UberGraph on May 14, 2023.

Here is the breakdown of the number of cliques if you want to look at the distribution of information content values in Babel -- note that the second column is NOT sorted, even though it looks sort of like it is at first. ic-values-all-ic-sorted-uniq.txt

gaurav commented 1 year ago

I'd like to add this to a Babel report that gets generated regularly, but that's a low priority task. If there's other high priority tasks here, please open a new ticket for those.