GeneDx / phenopy

Phenotype comparison tools using semantic similarity.
Other
55 stars 11 forks source link

Calculate gene similarity on the HPO #60

Open stefanucci-luca opened 3 years ago

stefanucci-luca commented 3 years ago

Dear Kevin,

I would like to calculate the similarity for a few genes (~2000). I annotated these genes with the HPO codes from the human phenotype ontology webpage (http://compbio.charite.de/jenkins/job/hpo.annotations/lastSuccessfulBuild/artifact/util/annotation/genes_to_phenotype.txt).

I obtained reshaped and got a file like this:

A4GALT  .   HP:0010970|HP:0000006
AAAS    .   HP:0040281|HP:0040282|HP:0040283|HP:0011463|HP:0001278|HP:0000972|HP:0012332|HP:0008259|HP:0004322|HP:0001251|HP:0000648|HP:0000007|HP:0002571|HP:0004319|HP:0001263|HP:0008163|HP:0001249|HP:0009916|HP:0003487|HP:0007002|HP:0000252|HP:0001347|HP:0000522|HP:0003676|HP:0000649|HP:0001324|HP:0000953|HP:0001260|HP:0000846|HP:0001250|HP:0007440|HP:0000505|HP:0000982|HP:0001761|HP:0010486|HP:0000830|HP:0007556|HP:0002093|HP:0001430|HP:0001252|HP:0002376|HP:0000612|HP:0000407
AASS    .   HP:0000119|HP:0000752|HP:0001083|HP:0001903|HP:0003593|HP:0001250|HP:0002161|HP:0000736|HP:0001252|HP:0100543|HP:0000007|HP:0001256|HP:0000750|HP:0001249
ABAT    .   HP:0025356|HP:0000278|HP:0000098|HP:0007291|HP:0000007|HP:0002415|HP:0001321|HP:0000494|HP:0001347|HP:0006829|HP:0001263|HP:0001274|HP:0001250|HP:0001254|HP:0025430|HP:0003819
ABCA4   .   HP:0040280|HP:0040281|HP:0040282|HP:0040283|HP:0040284|HP:0000006|HP:0007663|HP:0000662|HP:0001133|HP:0000608|HP:0000512|HP:0000543|HP:0000007|HP:0007737|HP:0007722|HP:0000510|HP:0007984|HP:0007843|HP:0000548|HP:0000580|HP:0000572|HP:0008035|HP:0000639|HP:0000618|HP:0000405|HP:0000603|HP:0000135|HP:0000493|HP:0000463|HP:0001249|HP:0007703|HP:0000613|HP:0000987|HP:0030329|HP:0000649|HP:0000648|HP:0000551|HP:0008046|HP:0000407|HP:0007704|HP:0007814|HP:0008736|HP:0000035|HP:0008002|HP:0007675|HP:0000431|HP:0000610|HP:0000518|HP:0000602|HP:0001513|HP:0008059|HP:0000501|HP:0000563|HP:0000842|HP:0030500|HP:0001347|HP:0000505|HP:0005978|HP:0011504|HP:0011462|HP:0011463|HP:0003621|HP:0007994
ABCB11  .   HP:0040283|HP:0000989|HP:0002014|HP:0003155|HP:0000952|HP:0001081|HP:0003593|HP:0001394|HP:0001744|HP:0001046|HP:0002240|HP:0002630|HP:0002908|HP:0000007|HP:0003819|HP:0004322|HP:0001508|HP:0001406|HP:0001402

which I think is the correct format for phenopy. I then used the command:

phenopy score gene_lists_with_HPO.txt --threads 12 --self

and I got as output something like this:

#query  entity_id   score
A4GALT  A4GALT  1.0
A4GALT  ABCD1   0.0
A4GALT  ACAT1   0.010405043493187662
A4GALT  ACVRL1  0.03336405048957507
A4GALT  ADGRG1  0.0
A4GALT  AGXT    0.009234121604447244
A4GALT  AKT1    0.003509945769583653
A4GALT  ALG1    0.0
A4GALT  AMER1   0.0

However, the identity for some genes are not 1 as I was expecting. For instance:

ABCB7 ABCB7 0.5558528984777618

Would you expect something like this? How would you explain it? Should I use a different --summarization-method ?

Best regards,

Luca

arvkevi commented 3 years ago

Hi Luca,

Thank you for checking out the repo. It looks like you have successfully run phenopy on your input files, that's great! The behavior you describe is expected. It's a property of the HRSS semantic similarity scoring algorithm. It's a way to scale similarity scores by rewarding nodes being compared further down the ontology. The way the algorithm is implemented here, even a phenotype-to-itself is only ever 1.0 by HRSS when the beta_ic is 0.0. This is the case in leaf nodes. Does this explanation help?

viktorzou commented 4 months ago

so how would i set a network-cutoff value then, if same terms might not result in 1.0? Also is there any possibility to introduce my own scores, if I have some frequency values attached to Phenotypes?