Closed Colossus closed 8 years ago
This is a query to get all Charite genephenos where we don't ever pick up the pheno:
select distinct
hpo_id,
canonical_name
from
charite c
join genes g
on (g.ensembl_id = c.ensembl_id)
left join pheno_mentions p
on (c.hpo_id = p.entity)
where p.doc_id is null;
So many Charite phenos are not "allowed" phenotypic abnormalities (abnormality; no cancer):
select count(distinct hpo_id) from charite;
6074
select count(distinct hpo_id) from charite where hpo_id not in (select distinct hpo_id from allowed_phenos);
346
So almost no "disallowed phenos"
We should synonym "Abnormality of skin physiology" and all "abnormalities of" to "abnormal blah" automatically.
"Unossified sacrum" ... hard to find. Have only 11 sentences in whole database with word "unossified".
Why don't we find anything with "Thymoma" (HP:0100522)?? This should be an easy one
EDIT: it's a cancer
Maybe just dump all "abnormality of" and "abnormal" prefixes such as in "abnormal eye physiology"
neoplasm; cancer; tumor; should all be tumors. insert synonyms manually
Why don't we pick up "stillbirth" HP:0003826??
EDIT: It's not a phenotypic abnormality ... we're getting a little morbid here
Chop off "morphology" and "physiology" suffixes such as in "Abnormal trabecular bone morphology" or "abnormal eye physiology" ... unless all that's remaining is a simple english word such as eye
Should we leave "sarcoma" phenotypes in? and allow them? like HP:0200058 angiosarcoma?
EDIT: forget about them, it's cancer
split phenos containing a slash and create synonyms; won't work perfectly but better
split only slashed word
If dropping "abnormality of (the)"/"abnormal" leaves only one word, don't add single word; add "physiology", "morphology", "dysplasia", "hypoplasia", "aplasia" to single word and add that
replace all "physiology" by "morphology"
when dropping abnormality, in general add "physiology", "morphology", "dysplasia", "hypoplasia", "aplasia"
Not too sure if this is the best place, but I'm going to put my Charite pheno recall analysis here. I.e. why are we extracting far fewer phenos than Charite?