Knowledge-Graph-Hub / automate-pheno-comparisons

Jenkins-based automation of phenotype semantic similarity on PHENIO with Semsimian.
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Updates to produce new, better phenotype comparisons #9

Closed caufieldjh closed 5 months ago

caufieldjh commented 6 months ago

This currently will generate the IC maps but won't yet pass them to the call to runoak similarity

caufieldjh commented 6 months ago

On the Jenkins run, the line for building the custom IC map fails due to an SQL error, as below:

09:33:34  + runoak -g hpoa.tsv -G hpoa -i sqlite:obo:phenio information-content -p i --use-associations .all
12:32:40  sqlite3.OperationalError: too many SQL variables
12:32:40  
12:32:40  The above exception was the direct cause of the following exception:
12:32:40  
12:32:40  sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) too many SQL variables
12:32:40  [SQL: SELECT term_association.id AS term_association_id, term_association.subject AS term_association_subject, term_association.predicate AS term_association_predicate, term_association.object AS term_association_object, term_association.evidence_type AS term_association_evidence_type, term_association.publication AS term_association_publication, term_association.source AS term_association_source 
12:32:40  FROM term_association 
12:32:40  WHERE term_association.object IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
...
)]
12:32:40  [parameters: ('HP:0011383', 'UBERON:0001659', 'FBbt:00049536', 'ZP:0021559', 'EMAPA:37004', 'MONDO:1011007', 'XPO:0125859', 'EMAPA:37475', 'XPO:0140513', 'ZP:0017420', 'ZP:0101168', 'WBbt:0005868', 'ZP:0107311', 'UBERON:0004797', 'UPHENO:0083165', 'FBbt:20003566', 'MONDO:0020694', 'FBbt:20005476', 'ZP:0108043', 'GO:0038037', 'ZP:0106629', 'MONDO:0021104', 'FBbt:00007621', 'MONDO:0014365', 'GO:0099504', 'XPO:0132914', 'XPO:0131894', 'UPHENO:0033626', 'UBERON:2000989', 'UPHENO:0068091', 'GO:0060309', 'ZP:0144110', 'http://purl.org/sig/ont/fma/fma87217', 'XPO:0131698', 'XPO:0140503', 'http://purl.org/sig/ont/fma/fma70973', 'EMAPA:35202', 'ZP:0108015', 'UPHENO:0049528', 'MONDO:1010565', 'GO:0061649', 'MP:0030985', 'EMAPA:17217', 'EMAPA:37775', 'ZP:0142756', 'HP:0009956', 'ZP:0022036', 'FBbt:00110255', 'ZP:0006550', 'MONDO:0007939' ... 264610 parameters truncated ... 'MP:0030179', 'HP:0100737', 'ZP:0137599', 'HP:0006379', 'XPO:0101904', 'CHEBI:24405', 'MP:0001077', 'FBbt:20003713', 'UPHENO:0077064', 'FYPO:0003726', 'GO:1903624', 'UPHENO:0085196', 'http://purl.org/sig/ont/fma/fma77433', 'UPHENO:0008937', 'ZP:0014100', 'HP:0032858', 'GO:0032085', 'MONDO:0006790', 'MONDO:0012249', 'XPO:0125538', 'FBbt:20004698', 'FBbt:00111359', 'GO:0034869', 'ZP:0009516', 'XPO:0101461', 'MA:0000955', 'FBbt:00001730', 'NCBITaxon:9256', 'ZP:0144612', 'XPO:0129035', 'XPO:0128313', 'XPO:0128764', 'FYPO:0007898', 'MP:0011753', 'GO:0010948', 'HP:0032099', 'MONDO:0018480', 'UPHENO:0050161', 'ZP:0107769', 'FBbt:00047743', 'UBERON:0001453', 'FBbt:00000230', 'XPO:0130441', 'MONDO:0003952', 'ZP:0108382', 'MP:0001866', 'http://purl.org/sig/ont/fma/fma86489', 'WBPhenotype:0002504', 'UBERON:0016536', 'UPHENO:0083851')]
12:32:40  (Background on this error at: https://sqlalche.me/e/20/e3q8)
caufieldjh commented 6 months ago

This may require sqlite to be updated on the build environment. See also https://github.com/deepset-ai/haystack/issues/588

caufieldjh commented 6 months ago

With changes in https://github.com/INCATools/ontology-access-kit/pull/764 I expect this to work better - will just need to pin OAK version to github

caufieldjh commented 6 months ago

Will test after merging https://github.com/INCATools/ontology-access-kit/pull/759

caufieldjh commented 6 months ago

Getting Jenkins errors on test - unclear why:

Error when executing always post condition:
Also:   org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 7c6506e2-fc79-47e8-b60a-bfd6d52143dd
org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
Perhaps you forgot to surround the code with a step that provides this, such as: node
    at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:89)
    at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:70)
    at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
caufieldjh commented 6 months ago

The Jenkins error was due to string interpolation in environment variable names not working as expected.

caufieldjh commented 6 months ago

Getting a ModuleNotFoundError: No module named 'semsimian' because semsimian is optional dependency - fixing

caufieldjh commented 6 months ago

Good news: IC map is produced quickly and without errors! Bad news: Something (OAK or Semsimian) is raising an error:

10:13:21  + runoak -i semsimian:sqlite:obo:phenio similarity --no-autolabel --information-content-file hpoa_ic.tsv -p i --set1-file HPO_terms.txt --set2-file HPO_terms.txt -O csv -o HP_vs_HP_semsimian_-20240516.tsv --min-ancestor-information-content 4.0
10:13:39  Loading custom IC map from: "hpoa_ic.tsv"
10:13:39  Failed to import custom IC map: Error parsing IC value: invalid float literal
caufieldjh commented 6 months ago

That error may be a false positive - when I run a single pair locally and pass the IC map, I don't get that error:

$ runoak -vvv -i semsimian:sqlite:obo:phenio similarity --no-autolabel --information-content-file hpoa_ic.tsv -p i  -O csv -o test.tsv --min-ancestor-information-content 4.0 HP:
0007166 @ HP:0007153
INFO:root:Setting other_languages=()
INFO:root:Settings = Settings(impl=None, autosave=False, associations_type=None, preferred_language=None, other_languages=())
INFO:root:Wrapping an existing OAK implementation to fetch sqlite:obo:phenio
INFO:root:Locator: obo:phenio
INFO:root:Ensuring gunzipped for https://s3.amazonaws.com/bbop-sqlite/phenio.db.gz
INFO:root:Locator, post-processed: sqlite:////home/harry/.data/oaklib/phenio.db
DEBUG:root:Paths to search: [PurePosixPath('model/schema'), PurePosixPath('schema'), PurePosixPath('linkml'), PurePosixPath('src/linkml'), PurePosixPath('src/model'), PurePosixPath('src/model/schema'), PurePosixPath('src/schema'), PurePosixPath('.')]
DEBUG:root:candidate model/schema not found
DEBUG:root:candidate schema not found
DEBUG:root:candidate linkml not found
DEBUG:root:candidate src/linkml not found
DEBUG:root:candidate src/model not found
DEBUG:root:candidate src/model/schema not found
DEBUG:root:candidate src/schema not found
INFO:root:out=test.tsv <class 'str'>
INFO:root:file=<_io.TextIOWrapper name='test.tsv' mode='w' encoding='UTF-8'> <class 'str'>
INFO:root:Splitting terms ['HP:0007166', '@', 'HP:0007153'] on 1
INFO:root:Calculating all-by-all pairwise similarity for 1 objects
[00:00:09] Building closure and IC map: ████████████████████████████████████████ 100%                                                                                                                                             INFO:root:Post-processing results from semsimian

This is the relevant function in semsimian: https://github.com/monarch-initiative/semsimian/blob/e2cba82624dc0633092e0b299d80f3771d01659c/src/utils.rs#L579-L607

Maybe it's just trying to parse EOF? Seems unlikely as there isn't a tab in the last line.

Oh. Wait. No. It's the header. Semsimian doesn't expect it. Not sure why that error isn't being raised locally but I'll put a fix for it in here.

caufieldjh commented 6 months ago

Looks like the map is now loading correctly, at least for HP.

12:14:28  + runoak -i semsimian:sqlite:obo:phenio similarity --no-autolabel --information-content-file hpoa_ic.tsv -p i --set1-file HPO_terms.txt --set2-file HPO_terms.txt -O csv -o HP_vs_HP_semsimian_-20240516.tsv --min-ancestor-information-content 4.0
12:14:46  Loading custom IC map from: "hpoa_ic.tsv"
12:14:46  Custom IC map imported successfully.
12:14:55  Warning: The following keys are present in closure_map but not in ic_map:

This is followed by a list of more than 20K CURIEs, including 5952 HP IDs. Might be excessive to write that to stdout.

justaddcoffee commented 6 months ago

I wonder why so many CURIEs are in closure_map but not ic_map?

caufieldjh commented 6 months ago

If the HP ID doesn't show up in the association table at all, it doesn't get an IC score. So many of these IDs (just based on cursory checking, not exhaustive at all) are rare phenotypes, like HP:0100378 (Absent distal phalanx of the 3rd toe) or high-level terms that just aren't specific enough to get used in associations, like HP:0008008 (Reduced visual acuity) - the latter's child term of HP:0000618 (Blindness) is used 168 times in the HPOA, for example.

Is this the expected behavior, though? Should all terms receive some sort of baseline score by virtue of existing, then have that score get adjusted proportionally by observed frequency?

Edit: the CURIE for Reduced visual acuity is actually HP:0007663, and that does appear in the IC table. HP:0008008 is the obsolete term Progressive central visual loss. The obsolete term appears in the closure but not in the IC table.

justaddcoffee commented 6 months ago

Is this the expected behavior, though?

This is expected, but is going to cause problems. I think we will need IC values for everything in phenio

Should all terms receive some sort of baseline score by virtue of existing, then have that score get adjusted proportionally by observed frequency?

We could do that, but I can't think of how to convince OAK to calculate IC like that.

How about this: if a term is not observed, we set the IC to -log(1/number of total counts). This is essentially setting the count of that term to 1.

To do this, we could post-process the IC tsv file to set any term in phenio that isn't in there to max(IC score in the IC TSV file). Or, we could do this in semsimian after we read in the IC tsv file.

caufieldjh commented 6 months ago

TODO:

caufieldjh commented 6 months ago

Also something odd happened to the HPvMP and HPvZP uploads. Will check on that

caufieldjh commented 6 months ago

Aha - the issue is that the IC tables for MP and ZP aren't generating what I'd expect. Running this:

runoak -g mpa.tsv -G hpoa_g2p -i sqlite:obo:phenio information-content -p i --use-associations .all > mpa_ic.tsv

Only outputs 137 ICs, none of them from MP! It looks like the subject_taxon column is being used as the calculation input instead of what's in the object column. So the hpoa_g2p parser isn't working as expected. Other parsers are listed here: https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/parsers/__init__.py