expanded functionality for `biolink:publications` edge-attribute

colleenXu commented 1 year ago

The Translator UI is supposed to be able to handle more kinds of "references" (publications) for an edge - not just the PMIDs we provide in the biolink:publications edge-attribute right now. In Translator Slack comms, the UI team has confirmed that they plan to support the specification here.

For now, we don't have to worry about "free-text description"-style references (we don't really have any of these).

And I'll explain the spec below...

Implementation

We'd like to adjust / expand our behavior to match this spec and provide more reference info to users....by taking the values from sometimes multiple fields, replacing/appending proper prefixes, and putting them into 1 edge-attribute.

Here's what's involved:

This is the set of response-mapping keys for fields that we want to use as input for the biolink:publications's value, and how they should be processed:
1. ref_pmid (previously pubmed): we want the output-strings to have the prefix PMID
2. ref_url (previously biolink:source_web_page): no processing needed. The strings are urls
3. ref_pmcid: we want the output-strings to have the prefix PMCID (however, I made a biolink-model issue because which prefix to use was confusing https://github.com/biolink/biolink-model/issues/1366)
4. ref_clinicaltrials: we want the output-strings to have the prefix clinicaltrials. However, the spec said putting this data in this edge-attribute was temporary / in-flux...
5. ref_doi: we want the output-strings to have the prefix doi (biolink-model spelling ref)
6. ref_isbn: we want the output-strings to have the prefix isbn (biolink-model spelling ref)
I've updated the x-bte annotation to use these special response-mapping keys here. To use these SmartAPI yamls for dev work, put the content below into your local BTE smartapi_overrides file:

SmartAPI overrides

Note: [PharmGKB](https://github.com/NCATS-Tangerine/translator-api-registry/blob/publication-keywords/pharmgkb/smartapi.yaml) is excluded here because it isn't added to the config file yet. it can be added here if you want, but the stuff listed here should be plenty to test the 6 response-mapping keys above... ``` { "conf": { "only_overrides": false }, "apis": { "0212611d1c670f9107baf00b77f0889a": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/CTD/smartapi.yaml", "1f47552dabd67351d4c625adb0a10d00": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/EBIgene2phenotype/smartapi.yaml", "77ed27f111262d0289ed4f4071faa619": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/MGIgene2phenotype/smartapi.yaml", "38e9e5169a72aee3659c9ddba956790d": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/bindingdb/smartapi.yaml", "e3edd325c76f2992a111b43a907a4870": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/dgidb/openapi.yml", "316eab811fd9ef1097df98bcaa9f7361": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/gtrx/gtrx.yaml", "dca415f2d792976af9d642b7e73f7a41": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/litvar/smartapi.yaml", "8f08d1446e0bb9c2b323713ce83e2bd3": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mychem.info/openapi_full.yml", "671b45c0301c8624abbd26ae78449ca2": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mydisease.info/smartapi.yaml", "59dce17363dce279d389100834e43648": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/mygene.info/openapi_full.yml", "09c8782d9f4027712e65b95424adba79": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/myvariant.info/openapi_full.yml", "b772ebfbfa536bba37764d7fddb11d6f": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/ncats_rare_source/smartapi.yaml", "edeb26858bd27d0322af93e7a9e08761": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/pfocr/smartapi.yaml", "03283cc2b21c077be6794e1704b1d230": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/rhea/smartapi.yaml", "1d288b3a3caf75d541ffaae3aab386c8": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/semmeddb/smartapi.yaml", "d22b657426375a5295e7da8a303b9893": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/publication-keywords/biolink/openapi.yml" } } ```

For each KG edge, we'll want to take these 6 response-mapping keys, process the fields' contents, and stuff those output-strings together into an array of strings that's 1 biolink:publications edge-attribute (many-to-1). It should have this format:

{
    "attribute_type_id": "biolink:publications",
    "value": [
        "PMID:1234",
        "http://hello_world.com",
        "PMCID:PMC1234",
        "clinicaltrials:NCT1234",
        "doi:1234/1234",
        "isbn:1234-1234-1"
    ],
    "value_type_id": "linkml:Uriorcurie"

}

Potentially-helpful implementation notes:

we want the code to handle when the field's value is null or an empty string (ignore, don't add to output?)
we want the code to remove duplicates in the value array, after the array has been assembled (and after records are merged into edges...)
sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)

colleenXu commented 1 year ago

Issues we'll want to deal with at some point:

I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do...
is the value_type_id above correct? I'll need to check

Known issue, set-aside and out-of-scope for now: We won't fully follow the spec because of some limitations with the current x-bte annotation (this may improve with JQ-related processing):

click here to expand

The documentation says we "MUST report only one identifier per publication" and must report the CURIE (not the full url) whenever we can. But right now, we won't follow this in these cases: * MyChem: * Chembl drug mechanisms: both the CURIE and the expanded url will be present for PMID / clinicaltrials / doi / PMC. This is because each reference object has a full url, but only some kinds also had the ID, and I included response-mapping keys for both CURIE stuff and url stuff... * Chembl drug indications: both the CURIE and the expanded url will be present for clinicaltrials. Same reasoning as above * Bindingdb: in some cases, an edge will have a doi + a pmid that actually refer to the same publication (they're in the same `relation` object of the original API response). I annotated both fields because there are plenty of cases where relationships have only 1 of the fields * PharmGKB (not added to config list yet): `ref_url: data.literature._sameAs` seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

Finally: the spec says a second biolink:publications edge-attribute can be made when the reference info is a free-text string. I don't think we need to do that in this issue, because I didn't notice any strong examples of these in the SmartAPI yamls...

rjawesome commented 1 year ago

sometimes a field's value will already have a prefix (it may or may not be formatted exactly the way it should be for biolink-model) and sometimes it won't. So sometimes we'll be adding a prefix, sometimes replacing it, and sometimes we may not need to do anything (it's already formatted the way we want)

My current plan is to check if the prefix [with ":" so like "PMID:"] is there (in any casing), and if so, strip the prefix. Then just add the prefix. Is there any other cases that should be handled?

I didn't modify biolink / monarch API but I'll need to (it still uses the pubmed response-mapping key). I think this is tricky to work with because of the special api-response-transform that happens to it. It's not clear to me if the transformed publications field holds only PMIDs or if it can sometimes have non-PMID publications the way the raw API responses do

It seems like the current code is attempting to filter out only PMID IDs. But if they are using the same/known prefixes for the other ID types, then we have two options

this transformer could be modified to put different ID types into different properties based on prefix (ie. publicationsPMID, publicationsPMCID, etc.) which could then be referenced in the smartapi yaml using ref_pmid, ref_pmcid, etc.
if the publication ids from biolink is trusted to be formatted exactly as we want it, we could modify the transformer directly take the raw API publications field and put it into the publications field of the record

PharmGKB (not added to config list yet): ref_url: data.literature._sameAs seems to be the best way to get 1 value per unique reference. However, sometimes this field's value is an expanded url for a PMID/PMCID (clinical guidelines, variant annotation). There shouldn't be any issues with "reporting the same publication more than once"

If we wanted to use the CURIE's when possible, we could parse the URL looking for the URLs that identify PMID/PMCID (ie. http://www.ncbi.nlm.nih.gov/pmc/, http://www.ncbi.nlm.nih.gov/pubmed), and then translate it to a PMID/PMCID/etc

colleenXu commented 1 year ago

Remember, you can search the PR / branch for SmartAPI yamls that contain the keywords you're testing

Test for ref_isbn, ref_pmid, ref_url

* Query only MyDisease through BTE with the TRAPI query below. * [The sub-query to MyDisease will retrieve this data](https://mydisease.info/v1/query?q=hpo.omim:%22127300%22&fields=hpo) * `phenotype_related_to_disease` objects have diff mixes of `ref_` fields: * `HP:0002762` has both a pmid and a website field (representing two different references) * many have only the isbn field or only the website field * `ref_` field output has prefixes on it (isbn prefix needs replacing) * Note: `ref_isbn` is only used in the MyDisease Disease -> PhenotypicFeature operations (response-mapping [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/mydisease.info/smartapi.yaml#L631)), so that's what this test is based on. ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["OMIM:127300"], "categories":["biolink:Disease"] }, "n1": { "categories":["biolink:PhenotypicFeature"] } }, "edges": { "e1": { "subject": "n0", "object": "n1" } } } } } ```

Test for ref_doi, ref_clinicaltrials (and ref_pmid and ref_url)

* This test is based on MyChem's drug-mechanism operations (response-mapping [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/mychem.info/openapi_full.yml#L632) and [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/mychem.info/openapi_full.yml#L658)) * Query only MyChem through BTE with the TRAPI query below. * [The sub-query to MyChem will retrieve this data](https://mychem.info/v1/query?q=chembl.molecule_chembl_id:CHEMBL1278146&fields=chembl) * `drug_mechanisms` objects have diff mixes of `ref_` fields: * the target_components.uniprot`P10721` has mechanism_refs for doid, clinicaltrials, pmid, and url * notice that for this KP, no `ref_` field output has prefixes * Note that there are other operations with keywords ref_doi and ref_clinicaltrials ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["CHEMBL.COMPOUND:CHEMBL1278146"], "categories":["biolink:SmallMolecule"] }, "n1": { "categories":["biolink:Gene"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:interacts_with"] } } } } } ```

Test for ref_pmc, (and ref_url)

* This test is based on MyChem's drug-mechanism operations (response-mapping [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/mychem.info/openapi_full.yml#L632) and [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/mychem.info/openapi_full.yml#L658)) * Query only MyChem through BTE with the TRAPI query below. * [The sub-query to MyChem will retrieve this data](https://mychem.info/v1/query?q=chembl.molecule_chembl_id:CHEMBL1743036&fields=chembl) * the target_components.uniprot`O14763` has mechanism_refs for pmc and url (the object that's `type: "Other"` * Note that there are other operations that use the keyword ref_pmc ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["CHEMBL.COMPOUND:CHEMBL1743036"], "categories":["biolink:SmallMolecule"] }, "n1": { "categories":["biolink:Gene"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:interacts_with"] } } } } } ```

Test using bindingdb: ref_pmid, ref_url, ref_doi

* This test is based on BindingDB operations (response-mapping [here](https://github.com/NCATS-Tangerine/translator-api-registry/blob/3c71ef2262259a16f5d0fb16258f4244a9de3619/bindingdb/smartapi.yaml#L641)) * Query only BindingDB through BTE with the TRAPI query below. * [The sub-query to BindingDB will retrieve this data](https://pending.biothings.io/bindingdb/query?q=subject.uniprot.accession:Q9NWZ3%20AND%20object.pubchem_cid:90134669) * relation field holds 3 objects: * 2 have all 3 `ref_` fields (pmid, doi, url). pmid + doi refer to the same publication * 1 has only the url field ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["UniProtKB:Q9NWZ3"], "categories":["biolink:Gene"] }, "n1": { "ids":["PUBCHEM.COMPOUND:90134669"], "categories":["biolink:SmallMolecule"] } }, "edges": { "e1": { "subject": "n0", "object": "n1" } } } } } ```

colleenXu commented 1 year ago

@rjawesome

On checking / stripping / replacing prefixes: I think that's fine. I was wondering if it'd be any faster to keep the prefix and not strip it out, when it's already in the correct format (exactly correct case/spelling).
On biolink/monarch. Hmm:
- for now, I've updated the branch's yaml to use ref_pmid and updated the SmartAPI overrides above to include biolink/monarch API
- I'm not sure what other kinds of references are in their output...we could ask them...or do you have an idea of how we could figure this out?
- I have a note saying there was WormBase:WBPaper. But I don't think we'd recognize this in Translator...
- I lean towards the "put into different properties" since that would give us finer control...
On converting urls to CURIEs: If this is possible, this is a GREAT idea that could help with most of the cases (convert to CURIEs, then remove duplicates). However, I think this is extra functionality (not critical).

Converting urls to CURIEs may also not be possible in all cases:

I think cases with DOI are tricky because there are a lot of possible base URLs + sometimes the DOI ID doesn't quite match up with the url (ex: doi is 10.1158/1538-7445.AM2013-DDT02-01 vs url is http://cancerres.aacrjournals.org/content/73/8_Supplement/DDT02-01).
MyChem drug_indications clinicaltrial IDs aren't a perfect match to their url. ex: url is https://clinicaltrials.gov/search?id=%22NCT00485888%22 and ID is NCT00485888. The problem is the trailing %22
I dunno if we want to handle cases like http vs https or having the www. vs not

Here's some lists of base URLs that we'd want to turn into CURIE prefixes

The stuff after the url should be the ID. These are the exact base urls I've found. Turn into `PMID:`: * `http://europepmc.org/abstract/MED/` [example](http://europepmc.org/abstract/MED/21513885) * `https://www.ncbi.nlm.nih.gov/pubmed/` [example](https://www.ncbi.nlm.nih.gov/pubmed/22378157) Turn into `clinicaltrials:` * `https://clinicaltrials.gov/ct2/show/` [example](https://clinicaltrials.gov/ct2/show/NCT00046774) Turn into `PMCID:` * `http://europepmc.org/articles/` [example](http://europepmc.org/articles/PMC2786766) * `https://www.ncbi.nlm.nih.gov/pmc/articles/` [example](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3994233) Turn into `doi:` this may be too difficult to do since there's a lot of different possible base urls * `http://onlinelibrary.wiley.com/doi/` [example](http://onlinelibrary.wiley.com/doi/10.1002/ddr.20229/abstract) * `http://www.nejm.org/doi/full/` [example](http://www.nejm.org/doi/full/10.1056/NEJMra0907219) * `https://www.tandfonline.com/doi/abs/` [example](https://www.tandfonline.com/doi/abs/10.1179/bjms.1967.014)

rjawesome commented 1 year ago

See PRs (description of behavior on api-response-transform PR)

colleenXu commented 1 year ago

The corresponding SmartAPI updates have been done, and the registrations have been refreshed. https://github.com/NCATS-Tangerine/translator-api-registry/pull/128 This means all instances with this code deployed (dev/ci/test) should begin working with this feature within minutes (after they pull the latest registry info).

This update was need for this code to work properly. The code isn't back-compatible, so the old behavior (using the pubmed keyword in response-mapping) wasn't working on the instances that had a deployment with this code.

EDIT: until the code from this issue is deployed on Prod, Prod will have wonkiness with how it handles publication info - since it doesn't have the code to process the new response-mapping keywords. Jackson has already made a post in Translator Slack (general channel) informing the consortium of this.

colleenXu commented 1 year ago

And info from Aug 9-10th from UI team (Translator slack links):

they will support this enhanced publication info
they probably won't support isbn. In that case, their UI will silently ignore it when it's there, so we don't need to remove it for now

tokebe commented 1 year ago

@colleenXu can this be closed as completed?

colleenXu commented 1 year ago

Yep let's close this as complete since it's been deployed.

The limitations are:

"Converting urls to CURIEs may also not be possible in all cases", listed here
the https PMC urls for PFOCR figure urls vs PharmGKB articles with follow-up discussion here and here. Not addressed yet.
in some cases, spec says to add urls to the source's entry in the sources part of the TRAPI edge, rather than putting it into the publications edge-attribute. We haven't implemented this at all.

biothings / biothings_explorer

expanded functionality for `biolink:publications` edge-attribute #677

Implementation