Open gaurav opened 1 year ago
I'm not sure if there's a good way to run this within Babel, but it can be calculated from the Babel outputs by running:
$ srun --mem=100G jq -r '[.type, .identifiers[0].i, .ic] | @tsv' *.txt > ic-values-all.tsv
Out of 426,504,187 cliques, 423,470,395 cliques don't have information content values. Therefore, we have information content values for 3,033,792 cliques. This means that we use most of the 3,600,656 identifiers that we downloaded from UberGraph on May 14, 2023.
Here is the breakdown of the number of cliques if you want to look at the distribution of information content values in Babel -- note that the second column is NOT sorted, even though it looks sort of like it is at first. ic-values-all-ic-sorted-uniq.txt
I'd like to add this to a Babel report that gets generated regularly, but that's a low priority task. If there's other high priority tasks here, please open a new ticket for those.
Asked by @MarkDWilliams at the Translator June Relay.