Knowledge-Graph-Hub / automate-pheno-comparisons

Jenkins-based automation of phenotype semantic similarity on PHENIO with Semsimian.
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Handle cases of CURIEs in closure but not in IC map #12

Open caufieldjh opened 5 months ago

caufieldjh commented 5 months ago

Some CURIEs are not in the IC map, but are in the closure map. In general this is because the IC maps are build on the association tables, and those will never include all CURIEs in the closure map for a given phenotype ontology. What can we do about this?

Should all terms receive some sort of baseline score by virtue of existing, then have that score get adjusted proportionally by observed frequency? @justaddcoffee (https://github.com/Knowledge-Graph-Hub/automate-pheno-comparisons/pull/9#issuecomment-2116148470):

We could do that, but I can't think of how to convince OAK to calculate IC like that.

How about this: if a term is not observed, we set the IC to -log(1/number of total counts). This is essentially setting the count of that term to 1.

To do this, we could post-process the IC tsv file to set any term in phenio that isn't in there to max(IC score in the IC TSV file). Or, we could do this in semsimian after we read in the IC tsv file.

One option is to Add arg to semsimian + OAK to specify what to do for CURIEs in closure but not in IC map. One option should be to fail. Don't do that by default. Also provide option to infer (assign max IC to newly observed CURIEs) or use proportional IC.