Determine whether PHENIO edges are necessary in calculating HP vs HP semsim

caufieldjh commented 5 months ago

As per discussion w/ @justaddcoffee and @julesjacobsen, the most recent build prepares semsim values for both of the following:

HP vs HP, through PHENIO
HP vs HP, through HP alone The thought being that PHENIO edges won't contribute to a comparison of all vs all within a single ontology. In both of these builds, the IC cutoff was 1.5 and ICs are based on HPOA frequency. Some of the increased size of these builds is because they now contain ID labels.

So how do they differ? Below, I will refer to HP vs HP, through PHENIO as "PHENIO" and HP vs HP, through HP alone as "Alone".

Compressed size:

PHENIO: ~586 Mb
Alone: ~465 Mb

Uncompressed size of semsim table alone:

PHENIO: ~4.2 Gb, 22332765 lines
Alone: ~4.3 Gb, 23580617 lines

justaddcoffee commented 5 months ago

Uncompressed size of semsim table alone:

PHENIO: ~4.2 Gb, 22332765 lines Alone: ~4.3 Gb, 23580617 lines

This might seem weird, but is plausible: using HP alone possibly produces more pairs of HP terms that meet the IC cutoff, so the file is bigger

caufieldjh commented 5 months ago

URLs for the above:

caufieldjh commented 4 months ago

@julesjacobsen reports that the similarity tables with PHENIO edges result in better performance in Exomizer tests than the HP vs self similarities alone.

Knowledge-Graph-Hub / automate-pheno-comparisons

Determine whether PHENIO edges are necessary in calculating HP vs HP semsim #31