ExposuresProvider / cam-pipeline

Data loading pipeline for CAM database
https://exposuresprovider.github.io/cam-pipeline/
MIT License
2 stars 4 forks source link

Generate RO-Biolink predicate mappings based on a particular Biolink model #104

Open gaurav opened 11 months ago

gaurav commented 11 months ago

Adds scripts/generate_ro_biolink_mapping.sc, a Scala CLI script for generating a list of mappings between RDF predicates and Biolink predicates downloaded from two sources:

  1. ~The Biolink model (https://github.com/biolink/biolink-model/blob/68d4e3d7612275d0d7e832a9919bf8666e1d5fde/biolink-model.yaml)~
  2. The Biolink model's predicate mappings file (https://github.com/biolink/biolink-model/blob/68d4e3d7612275d0d7e832a9919bf8666e1d5fde/predicate_mapping.yaml)
  3. A few manual annotations from cam-kp-api PR 640.

These are written into the ro-to-biolink-predicate-mappings.tsv file (which I've included in this PR). If you want to see all the predicate mappings (not just the RO/GOREL ones), they are in the ro-to-biolink-predicate-mappings-all.tsv (https://github.com/ExposuresProvider/cam-pipeline/blob/e1d6dd063c43de31ac736dbd0ce1ee57008f64fc/ro-to-biolink-predicate-mappings-all.tsv).

This file is then used by scripts/kg_edges.dl to add "qualifiers" to kg.tsv. This does seem to work currently, producing output like:

GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-9645460     infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-9645460     infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-937042      infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-937042      infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-983168      infores:go-cam
GO:0004842      biolink:regulates       GO:0004842      http://model.geneontology.org/R-HSA-983168      infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
GO:0004842      biolink:regulates       GO:0004674      http://model.geneontology.org/62b4ffe300004589  infores:go-cam
GO:0004842      biolink:regulates       GO:0004674      http://model.geneontology.org/62b4ffe300004589  infores:go-cam  {"biolink:object_direction_qualifier":"upregulated"}
[...]
GO:0022857  biolink:affects CHEBI:641   http://model.geneontology.org/5d29221b00001552  infores:go-cam  {"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0051640  biolink:affects GO:0140494  http://model.geneontology.org/5ee8120100001898  infores:go-cam  {"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0031503  biolink:affects ComplexPortal:CPX-532   http://model.geneontology.org/5df932e000000551  infores:go-cam  {"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0034504  biolink:affects MGI:MGI:3036269 http://model.geneontology.org/5df932e000003298  infores:go-cam  {"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}
GO:0016197  biolink:affects GO:0005770  http://model.geneontology.org/5ee8120100000250  infores:go-cam  {"biolink:qualified_predicate":"biolink:causes"}||{"biolink:object_aspect_qualifier":"transport"}||{"biolink:object_direction_qualifier":"increased"}

Things to do:

This PR also adds the command for generating ro-to-biolink-predicate-mappings.tsv, although at the moment this will never be run, as the GitHub repo includes the predicate mappings file.

WIP: will close #95 once implemented.

gaurav commented 10 months ago

@balhoff I've now added checks that (1) look for duplication between the local mappings file and generated predicate files, and (2) look for Biolink predicates that are not present in the Biolink model. So far, I'm just printing out concerning PredicateMappings (which is based on the predicate mappings file generated as part of the Biolink model), so unfortunately this isn't very readable. Here's what the output looks like right now with 15 warnings:

01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ -- Found 15 mapping warnings:
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:increases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None), PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:increases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of, CTD:increases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:affects_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,Some(Set(CTD:affects_export)),None), PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:decreases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None), PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:decreases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of, CTD:decreases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps CTD:affects_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None), PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Generated predicate mapping file maps RO:0002212 to multiple Biolink terms: List(PredicateMappingRow(Some(entity negatively regulates entity),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212, RO:0002449)),None,None,None), PredicateMappingRow(Some(process negatively regulates process),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps RO:0002313 to multiple Biolink terms: List(PredicateMappingRow(None,None,None,biolink:affects,None,None,None,None,Some(Set(RO:0002313))), PredicateMappingRow(Some(increases transport of),Some(transport),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_transport_of)),None,None,Some(HashSet(RO:0002313, GAMMA:transporter, RO:0002340, GAMMA:carrier, RO:0002345))))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:increases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None), PredicateMappingRow(Some(increases secretion of),Some(secretion),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:increases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of, CTD:increases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(increases splicing of),Some(splicing),Some(increased),biolink:affects,Some(biolink:causes),Some(Set(CTD:increases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:affects_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,Some(Set(CTD:affects_export)),None), PredicateMappingRow(Some(affects secretion of),Some(secretion),None,biolink:affects,None,Some(Set(CTD:affects_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:decreases_secretion_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None), PredicateMappingRow(Some(decreases secretion of),Some(secretion),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_secretion_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:decreases_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of, CTD:decreases_RNA_splicing)),None,None,None), PredicateMappingRow(Some(decreases splicing of),Some(splicing),Some(decreased),biolink:affects,Some(biolink:causes),Some(Set(CTD:decreases_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps CTD:affects_splicing_of to multiple Biolink terms: List(PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None), PredicateMappingRow(Some(affects splicing of),Some(splicing),None,biolink:affects,None,Some(Set(CTD:affects_splicing_of)),None,None,None))
01:04:23.603 [zio-default-blocking-2] WARN generate_ro_biolink_mapping$ROBiolinkMappingsGenerator$ --  - Combined predicate mappings maps RO:0002212 to multiple Biolink terms: List(PredicateMappingRow(Some(entity negatively regulates entity),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212, RO:0002449)),None,None,None), PredicateMappingRow(Some(process negatively regulates process),None,Some(downregulated),biolink:regulates,None,Some(Set(RO:0002212)),None,None,None))

We can ignore the CTD mappings since we currently don't export those as all.

However, it looks like the following terms are duplicated:

gaurav commented 10 months ago

I've deleted RO:0002313 from local mappings in 797ff28.

gaurav commented 8 months ago

Hi @balhoff -- just wanted to poke you to review this PR. If you need help in incorporating it into the changes you've made to re-adding CTD, please let me know.

gaurav commented 4 months ago

Hi @balhoff -- just wanted to poke you to review this PR. If you need help in incorporating it into the changes you've made to re-adding CTD, please let me know.