anergictcell / hpo

Rust library for the Human Phenotype Ontology
6 stars 1 forks source link

Recent HPO versions include all parents in gene to phenotype associations #44

Closed holtgrewe closed 7 months ago

holtgrewe commented 1 year ago

For example, ARID1B has 532 (sic!) unique pairs of HPO and ARID1B in release 2023-06-06.

It looks like the full parent sub DAG is stored for each gene, as the association includes the All term.

I would suggest to prune the imported phenotype_to_genes.txt list as follows:

for each gene:
    all_terms := all terms associated to the gene
    terms_to_prune := []
    for each term:
        parents := all parents of term, excluding term
        terms_to_prune := terms_to_prune + parents
    actual_terms := all_terms - terms_to_prune
    // associate gene with the actual_terms

Otherwise, similarity computation get problematic for highly annotated genes such as ARID1B.

holtgrewe commented 1 year ago

After contact with the HPO author, it looks like actually using the genes_to_phenotypes.txt file would be more appropriate to import by the hpo create.

https://hpo.jax.org/app/resources/faq

image

anergictcell commented 1 year ago

Depending on your use case, you can remove all non-leaf terms and compare only the leaves of HpoSets. https://docs.rs/hpo/latest/hpo/struct.HpoSet.html#method.child_nodes

let gene = ontology.gene_by_name("ARID1B").unwrap();
let set = gene.to_hpo_set(&ontology).child_nodes();
set.similarity(....)

Not sure if that helps, but is something that I recommend for most comparisons.

While we're at it, for comparisons, I usually also remove modifier_terms (or remove them in place) so that the comparison only uses children of Phenotypical abnormality

anergictcell commented 7 months ago

Took a while for me to finally grasp this issue. I never considered this to be an problem, but now finally realized that terms should not be transitively added to genes. That way they will behave the same way as diseases. Will be fixed with this pull request and updated on crates.io with the next release.