Closed bill-baumgartner closed 2 years ago
@bill-baumgartner This is a great start. The KGX TSV format is described here: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md
I would recommend the following changes,
Edge Column | Example value |
---|---|
subject | PR:000010159 |
edge_label | biolink:expressed_in |
object | UBERON:0004362 |
relation | RO:0002206 |
association_id | jqkjkt-AqY3weukSpZEx_FK5QtU |
association_type | biolink:GeneToExpressionSiteAssociation |
publications | PMC324396 |
provided_by | CRAFT Corpus (manual annotation) |
If you have more than one publication then you can represent that as a |
delimited string.
To represent text span that supports this association, it would be easier to represent them as a separate node. For example, a biolink:InformationContentEntity
.
Node Column | Example value |
---|---|
id | uuid-123-456 |
name | |
category | biolink:InformationContentEntity |
sentence | "At E9.5 ERK5 expression was seen in the first and second branchial arch, cephalic region, somites and lateral ridge along the body wall." |
subject_spans | start: 8, end: 12 |
object_spans | start: 40, end: 45 | start: 57, end: 71 |
publications | PMC324396 |
And refer to this node in the association, Edge Column | Example value |
---|---|
subject | PR:000010159 |
edge_label | biolink:expressed_in |
object | UBERON:0004362 |
relation | RO:0002206 |
association_id | jqkjkt-AqY3weukSpZEx_FK5QtU |
association_type | biolink:GeneToExpressionSiteAssociation |
publications | PMC324396 |
provided_by | CRAFT Corpus (manual annotation) |
evidence | uuid-123-456 |
Of course, I am overloading the biolink:InformationContentEntity
node, and even the evidence
property.
But this is just to show a way of modeling this information.
We can create a biolink class that is specifically meant for representing text spans from NER/NLP efforts.
This is definitely worth a longer discussion, alongside our provenance discussion.
cc'ing @cmungall
@deepakunni3, thanks! This is very helpful. Closing for now as I think you've answered all of my questions at the moment. I will reopen if others come up. Also happy to contribute to the overall evidence/provenance discussion. Thanks.
@bill-baumgartner Just a small edit to what I wrote earlier.
It's good to use id
field for node and edges.
association_id
will be deprecated.
Hi @bill-baumgartner, @deepakunni3 and I have been taking a fresh look at this issue this morning.
One observation we make is that the "publications" supporting a given edge may have multiple PubMed identifiers.
This would imply the existence of more than one biolink:InformationContentEntity
node. Thus, the evidence identifier cannot resolve to a unique "primary" key to the sentences, since each PubMed will have its own sentence mapping.
It may therefore make more sense to use both the biolink:InformationContentEntity
node and the nodes "publications" identifier as a "composite key" (sensa relational... I know, these are graphs!).
That having been said, do we even need a separate identifier for a biolink:InformationContentEntity
node or should we rather instead use some identifier globally unique to the "edge" record itself (its 'id' or its 'association id')? Here we assume data curator assigned identifiers, distinct from any internal node or edge "primary key" identifier used by the database (e.g. internal node/edge identifiers for Neo4j).
Assuming this data model, a query against biolink:InformationContentEntity
node would simply use both the associated 'edge' identifier and the set of 'publications' identifiers, to retrieve a set of "evidence" citations with sentence mappings.
If each biolink:InformationContentEntity
node only describes one mapping againt a PubMed entry sentence, should the 'publications' field in the node be singular?
If the above data model "works" for tracking evidence, is there even any need for a separate "evidence" field in the "edge" model? (BTW, I just noted that the Biolink Model slot name is '_hasevidence' not just 'evidence')
There are probably some additional issues relating to the general theme of "provenance". Perhaps "evidence" supporting an edge is distinct from (but somewhat linked at the hips to) provenance, if one takes the meaning of "provenance" to refer to "who, what, where, when and why" did we identify and document this edge?"
We also probably have to consider other potential evidence types in the future. Perhaps the biolink:InformationContentEntity
node could have flexible property content, but perhaps a "mandatory" property against a suitable ontology akin to the GO "evidence code" to guide interpretation for the evidence citation.
@deepakunni3, am I forgetting anything here?
Hi @RichardBruskiewich and @deepakunni3, thanks for the followup. I think your composite key idea makes sense. The strategy I was planning on using is to splice the following fields of each evidence node into a string and then use the SHA1 hash as the evidence node identifier:
This allows for the same sentence to be used as evidence for multiple associations. It also provides a globally unique identifier for each evidence node which we plan on using in the future to collect feedback on text-mined associations. For example, we would like a user to be able to point to a specific piece of evidence as being incorrect.
@RichardBruskiewich, I think this strategy aligns with your comment above, but if it does not, please let me know.
Thanks @bill-baumgartner,
Clever idea. Your points about reusing and tracking sentences (for user feedback) are well taken. Of course, the alignment of the edge semantics against a given subject/(predicate?)/object span of a given sentence in a given PMID could be globally identified using the SHA1 hash for later retrieval for user feedback use cases, etc.
In addition to the subject and object spans, might there also be a need to tag the predicate?
That said, I wonder if perhaps we might wish to more precisely nail down the specific query use cases which will need to interact with the information and assess various solutions against them? In particular, I wonder about management of potential many-to-one (let alone, many-to-many) relationships between edge (association) assertions and a given text.
What I mean is that when I visit an edge, I only have a list of PMC ID;s and the edge id (i.e. whichever id @deepakunni3 says is canonical to the edge) to use in my query. One would not yet know the sentence text and subject/object spans in advance of such a query, but one expects to get back at least one sentence hit per PMC ID (or rather PMID, since not all papers cited will be in PubMed Central?), with the salient details.
Assuming the proposed SHA1 hash is used, one might expect to have to store every hash ID for every PMC ID/sentence match, in the edge record (under the has_evidence property). For some edges, the list could become quite long and given the availability of the edge id and PMC ID's in the record already, perhaps a subtle duplication of the minimal information needed to retrieve the sentence hit supporting the edge assertion? Would not the PMID/edge_id compose query key likely always give back a unique record of subject/predicate/span + sentence. I wonder if it would be the case that the subject/predicate/object hit of a given edge mapping to a given PMID sentence will ever have multiple matches against a given sentence (even if other edges also match that sentence). Note that even in SemMedDb, the sentence 'hit' is cataloged in a separate table ("PREDICATION_AUX") from the sentence ("SENTENCE"), thus reusability of sentences is built into the data model.
Once again, one suspects that the SHA1 hash remains a very useful strategy to support other anticipated use cases here (e.g. a globally unique identifier to directly individually track multiple semantic alignments against a given sentence in a given citation). Whether or not it helps the basic edge-to-sentence-hit traversal use case might merit further review.
@RichardBruskiewich, yes, the predicate should also be tagged. Thank you for pointing that out. This came up in a separate conversation I had yesterday as well, and was an oversight in my initial representation.
I think your idea to enumerate query use cases is a good one. My impression is that the Translator community will be querying for entity-predicate-entity triples (or triples with wildcards, e.g. entity-predicate-?, entity-?-entity, entity-?-?, etc.) and would then query for evidence supporting the triples that matched their query. Is your use case different from this? If so, can you expand on your requirements?
@deepakunni3, thank you again for editing my example representation. It seems to me that the provided_by and publications fields belong in the evidence node instead of the edge. This way, the edge represents the entity-predicate-entity triple and can be supported by different kinds of evidence, each with its own provenance. Do you have any objections to this approach?
Just a quick follow up - is the publications link tied to PubMed? The examples from the KGX are all PubMed references.
There are a number of conference papers (ACM, IEEE, etc..) that don't have a PubMed id but do do machine learning on biological data. Could a DOI be used as well?
Also, a URL (link to the actual paper) would be nice as well, but this may not be the place to discuss KGX format changes.
@ozborn Excellent point. We do need to generalize this beyond PubMed, in two ways: 1) the way you indicate, using CURIES (e.g. including DOI's) which are not PMID's 2) for non-textual evidence (which won't have sentence spans, but perhaps, some other indicators into a specific knowledge source)
I think your idea to enumerate query use cases is a good one. My impression is that the Translator community will be querying for entity-predicate-entity triples (or triples with wildcards, e.g. entity-predicate-?, entity-?-entity, entity-?-?, etc.) and would then query for evidence supporting the triples that matched their query. Is your use case different from this? If so, can you expand on your requirements?
@deepakunni3, thank you again for editing my example representation. It seems to me that the _providedby and publications fields belong in the evidence node instead of the edge. This way, the edge represents the entity-predicate-entity triple and can be supported by different kinds of evidence, each with its own provenance. Do you have any objections to this approach?
Thanks @bill-baumgartner. I'm not too averse to your conclusions here. Here is my feedback:
on the "provided_by" field (@deepakunni3, correct me if I'm wrong) I suspect that this field is meant to tag which specific curation authority was the source of the edge (association) assertion itself. The fact that a given assertion is supported by a PubMed entry does not answer this question. That said, there is the obvious point to be made about multiple independent sources (i.e. across ARA's or KP's or with various algorithms or original data providers) of the identical "entity-predicate-entity" assertions, thus moving the "_providedby" from the edge itself into an evidence (or perhaps, broadly speaking, "provenance" node) is sensible.
You are correct in noting that the "entity-predicate-entity" triplet generally suffices for the basic use case of ARA's resolving queries to a list of assertions. Each assertion will have a "global" uniqueness to them (since each of the underlying identifiers the triplet should be "globally identified", unless (sensa RDF there are "blank nodes" hidden therein). This serves the same purpose as a global edge (or association) id.
Clearly, in the second step in the workflow of "getting the evidence", such (global) edge identifiers likely suffices to return all "information nodes" which support a given (globally identified?) edge.
In this sense, the "publications" field (as embedded in the Biolink edge record) is technically redundant, perhaps only specified as a local (performance) convenience to data processing on the edge (i.e. once you have an edge in your hands with the list of publications as PMID's, your application can directly use those id's to link out to corresponding PubMed records online). Taking @ozborn's point above, using DOI's in that context may serve the same purpose in the same context.
That said, if the overwhelming standard use case is going to be dominated by the two step use case - query-to-edge, edge-to-evidence (likely alongside broad provenance annotation) - especially as evidence for specific assertions becomes prolific and heterogeneous (i.e. not just text-mining in nature, not just PMIDs,...), then I'd say moving the "publications" field (perhaps, now made singular, not plural) to the "evidence" ("provenance"?) node is a reasonable objective.
One final point: it looks like the provenance modeling dialog is on the front burner now (especially on the #reasonerapi Slack channel. Maybe two dragons need to be slayed with one stone here.
I think the Edge Evidence table above has most of the things needed, I also like the idea of DOI inclusion although maybe it shouldn't be the primary key if there are non-DOI citations we need to include.
My main concerns is the focus on the sentence. A lot of data in publications is in tables or spans sentences, I'm not sure how to reconcile this with the schema above. If you just cite the offset of a table entry (for say binding affinity) that has some numerical value - it is hard to validate this without pulling up the paper. However, looking at the n:1 relationship between text_alignment and sentence, maybe this is not an issue since sentence could have multiple text_alignments.
One fix may be to dispense with start_char_idx and end_char_idx for "Sentence" entirely OR to rename them to min_start_char_idx and max_end_char_idx to make it explicit that these are just a bound for the "Sentence". Sentence could also be renamed to "Excerpt" or something else indicating that it does not have to be a proper English sentence.
I think this has most of what we need though.
Hi @ozborn,
Matt Brush's SEPIO presentation this morning in the NCATS data modeling weekly meeting represents a far more thorough and complete modelling of evidence and provenance. To some extent, the above is mainly a quick brain dump inspired mainly by the Semantic Medline Database text mining data, without any pretense of being generic or the best way forward. I do take your comments to heart. At the same time, I'll likely take a run at aligning the above brainstorming with a SEPIO profile that meets the use cases implied here. Even more so, @cmungall has mentioned the notion of ingesting SEPIO to some degree into the Biolink Model. That seems logical. Hopefully, this might lead quickly to a solid implementations pathway for us all.
As an aside here, the above diagram has "Person-Author-Citation" which @bill-baumgartner has rightfully suggested is not within the scope of an Evidence model (although it may be linked to it). In this spirit, I notice that Chris has opened up a fresh issue #384 about publications, for discussion.
Thanks @RichardBruskiewich - I agree based on the presentation today that SEPIO is probably the way to go - but I haven't quite wrapped by head around all of it yet and I'm not sure to what degree we want to pull all of that in to Biolink.
I do think the notion of supporting non-pubmed publications, discontinuous spans and having critical publication metadata available (year, title, abstract) easily available/accessible without forcing the user to download (if they even can) the publication is important.
The following is a proposal for representing EPC data for text-mined Biolink associations in the KGX TSV format. This proposal is based on yesterday's presentation during the DM call by @cmungall. It attempts to follow Option 4
to represent a text-mined Biolink association that is supported by sentences in the literature.
There are several pieces of EPC data included:
Option 4
:The use case is a biolink:ChemicalToGeneAssociation
linking CHEBI:3215
(bupivacaine) to PR:000031567
(LRRC3B) using the biolink:entity_negatively_regulates_entity
predicate. For the purposes of this exercise we will use the following two sentences as evidence for the assertion:
PMID:29085514
PMID:12345678
id | name | category |
---|---|---|
CHEBI:3215 | bupivacaine | biolink:ChemicalEntity |
PR:000031567 | leucine-rich repeat-containing protein 3B | biolink:Protein |
subject | predicate | object | association_type | sentence_count | confidence_score | publications | _attributes |
---|---|---|---|---|---|---|---|
CHEBI:3215 | biolink:entity_negatively_regulates_entity | PR:000031567 | biolink:ChemicalToGeneAssociation | 2 | 0.9378 | PMID:29085514,PMID:12345678 | ATTRIBUTE_JSON_BLOB |
where the ATTRIBUTE_JSON_BLOB
would be the following:
- attribute_type_id: biolink:original_knowledge_source
value: infores:text-mining-provider-targeted
value_type_id: biolink:InformationResource
value_url: https://api.bte.ncats.io/v1/smartapi/978fe380a147a8641caf72320862697b/query/
description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_data_source
value: infores:pubmed
value_type_id: biolink:InformationResource
value_url: https://pubmed.ncbi.nlm.nih.gov/
description: PubMed® comprises citations for biomedical literature from MEDLINE, life science journals, and online books.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:has_evidence_count ## NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2
value_type_id: biolink:SentenceCount ## NOTE: THIS CLASS DOES NOT EXIST IN BIOLINK
description: The count of the number of sentences that assert this edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000168 # confidence_score
value: 0.9378
value_type_id: biolink:ConfidenceLevel
description: An aggregate confidence score that combines evidence from all sentences that support the edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000438 # has_supporting_evidence_from_source
value: PMID:29085514
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
description: A document that has part at least one sentence that asserts the Biolink association represented by this edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for a given article as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2017
value_type_id: UO:0000036 # year
description: The year this document was published
attribute_source: infores:pubmed
- attribute_type_id: SIO:000028 # has part
value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
value_type_id: EDAM:data_3671 # text, or SIO:000113 'sentence'
description: A sentence asserting the Biolink association represented by the parent edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: SIO:000028 # has part
value: '31|42'
value_type_id: SIO:001056 # character position
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '104|110'
value_type_id: SIO:001056 # character position
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000440 # has_supporting_evidence
value: 0.99956816
value_type_id: EDAM:data_1772 # score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: BFO:0000050 # part_of
value: IAO_0000315 # abstract
value_type_id: IAO_0000314 # document part
description: The part of the document where the sentence is located
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000438 # has_supporting_evidence_from_source
value: PMID:12345678
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
description: A document that has part at least one sentence that asserts the Biolink association represented by this edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for a given article as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2021
value_type_id: UO:0000036 # year
description: The year this document was published
attribute_source: infores:pubmed
- attribute_type_id: SIO:000028 # has part
value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
value_type_id: EDAM:data_3671 # text, or SIO:000113 'sentence'
description: A sentence asserting the Biolink association represented by the parent edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: SIO:000028 # has part
value: '42|53'
value_type_id: SIO:001056 # character position
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '75|81'
value_type_id: SIO:001056 # character position
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000440 # has_supporting_evidence
value: 0.876
value_type_id: EDAM:data_1772 # score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: BFO:0000050 # part_of
value: IAO_0000305 # document title
value_type_id: IAO_0000314 # document part
description: The part of the document where the sentence is located
attribute_source: infores:text-mining-provider-targeted
@cmungall , @RichardBruskiewich , @mikebada, @mbrush - if you have a moment to review at some point I would appreciate any feedback. I think there was talk of limiting the depth of the attribute nesting to two. The representation above uses a depth of three, so perhaps this proposal is a non-starter. Thanks in advance for any comments/suggestions! - Bill
Here is the next iteration of our use case. This version complies with the TRAPI attribute constraint that limits the nesting of attributes to a single level (see this PR), i.e. an attribute can have attributes, but its attributes cannot have attributes.
The use case is a biolink:ChemicalToGeneAssociation
linking CHEBI:3215
(bupivacaine) to PR:000031567
(LRRC3B) using the biolink:entity_negatively_regulates_entity
predicate. For the purposes of this exercise we will use the following two sentences as evidence for the assertion:
PMID:29085514
PMID:12345678
id | name | category |
---|---|---|
CHEBI:3215 | bupivacaine | biolink:ChemicalEntity |
PR:000031567 | leucine-rich repeat-containing protein 3B | biolink:Protein |
subject | predicate | object | id | association_type | sentence_count | confidence_score | publications | _attributes |
---|---|---|---|---|---|---|---|---|
CHEBI:3215 | biolink:entity_negatively_regulates_entity | PR:000031567 | hcR2-6QIJratLDFyFxwcSO6UW1M | biolink:ChemicalToGeneAssociation | 2 | 0.9378 | PMID:29085514,PMID:12345678 | ATTRIBUTE_JSON_BLOB |
where the ATTRIBUTE_JSON_BLOB
would be the following:
- attribute_type_id: biolink:original_knowledge_source
value: infores:text-mining-provider-targeted
value_type_id: biolink:InformationResource
description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_data_source
value: infores:pubmed
value_type_id: biolink:InformationResource
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:has_evidence_count ## NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2
value_type_id: biolink:EvidenceCount ## NOTE: THIS CLASS DOES NOT EXIST IN BIOLINK
description: The count of the number of sentences that assert this edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000168 # confidence_score
value: 0.9378
value_type_id: biolink:ConfidenceLevel
description: An aggregate confidence score that combines evidence from all sentences that support the edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000438 # has_supporting_evidence_from_source
value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
value_type_id: EDAM:data_3671 # text, or SIO:000113 'sentence'
description: A sentence asserting the Biolink association represented by the parent edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: BFO:0000050 # part_of
value: PMID:29085514
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2017
value_type_id: UO:0000036 # year
description: The year the document in which the sentence appears was published
attribute_source: infores:pubmed
- attribute_type_id: BFO:0000050 # part_of
value: IAO:0000315 # abstract
value_type_id: IAO_0000314 # document_part
description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
attribute_source: infores:pubmed
- attribute_type_id: SEPIO:0000440 # has_supporting_evidence
value: 0.99956816
value_type_id: EDAM:data_1772 # score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:has_identifier # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: HCX2k2hTBVNSoReGxxsGcL33jsg
value_type_id: EDAM:data_2091 # EDAM:accession
description: A unique identifier for the combination of document/sentence/assertion.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '31|42'
value_type_id: biolink:SubjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '104|110'
value_type_id: biolink:ObjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SEPIO:0000438 # has_supporting_evidence_from_source
value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
value_type_id: EDAM:data_3671 # text, or SIO:000113 'sentence'
description: A sentence asserting the Biolink association represented by the parent edge
attribute_source: infores:pubmed
attributes:
- attribute_type_id: BFO:0000050 # part_of
value: PMID:12345678
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: 2021
value_type_id: UO:0000036 # year
description: The year the document in which the sentence appears was published
attribute_source: infores:pubmed
- attribute_type_id: BFO:0000050 # part_of
value: IAO:0000315 # abstract
value_type_id: IAO_0000314 # document_part
description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
attribute_source: infores:pubmed
- attribute_type_id: SEPIO:0000440 # has_supporting_evidence
value: 0.876
value_type_id: EDAM:data_1772 # score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:has_identifier # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
value: HCX2k2hTBVNSoReGxxsGcL33jsg
value_type_id: EDAM:data_2091 # EDAM:accession
description: A unique identifier for the combination of document/sentence/assertion.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '42|53'
value_type_id: biolink:SubjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: SIO:000028 # has part
value: '75|81'
value_type_id: biolink:ObjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
@nlharris, if there is time, would it be possible to discuss this issue at the next Biolink Help Desk?
Up to @sierra-moxon but should be possible! I'll tag it.
Oh wait, this is in a different repo that doesn't have all the labels we have in biolink-model. I put it on the help desk agenda.
Thanks @nlharris! @sierra-moxon I should have tagged you here originally. Sorry about that. Let me know if you think a different venue would be more appropriate to discuss this instead of the Help Desk. Thanks!
Help desk is great! thank you for doing so much work on this, and for pointing out this ticket! We'll plan on it for next Monday. :D
Hi Bill, thanks so much for this. Its a great start. I have some specific feedback and suggestions that would best be discussed on a call, but I am OOO the first half of next week. Any chance we can we push it to the following week's Helpdesk, or discuss on another call next week? Anytime 8/5 of after.
Hi Matt. Very interested in your feedback so happy to push it back to the following week's Help Desk assuming that works for @sierra-moxon. Thanks!
yep that works :) sounds good!
This post summarizes discussions that have occurred over the past two weeks regarding the structure of EPC metadata for results returned by the Text Mining Provider. The use case is repeated from above for completeness. Note that this post is largely a recapitulation of both the example and figure (below) composed by @mbrush.
The use case is a biolink:ChemicalToGeneAssociation
linking CHEBI:3215
(bupivacaine) to PR:000031567
(LRRC3B) using the biolink:entity_negatively_regulates_entity
predicate. For the purposes of this exercise we will use the following two sentences as evidence for the assertion:
PMID:29085514
PMID:12345678
id | name | category |
---|---|---|
CHEBI:3215 | bupivacaine | biolink:ChemicalEntity |
PR:000031567 | leucine-rich repeat-containing protein 3B | biolink:Protein |
subject | predicate | object | id | association_type | confidence_score | supporting_study_results | supporting_publications | _attributes |
---|---|---|---|---|---|---|---|---|
CHEBI:3215 | biolink:entity_negatively_regulates_entity | PR:000031567 | hcR2-6QIJratLDFyFxwcSO6UW1M | biolink:ChemicalToGeneAssociation | 0.9378 | tmkp:HCX2k2hTBVNSoReGxxsGcL33jsg|tmkp:6c9D9220faF116beFa1e80800D4 | PMID:29085514|PMID:12345678 | ATTRIBUTE_JSON_BLOB |
where the ATTRIBUTE_JSON_BLOB
would be JSON represented by the following YAML:
- attribute_type_id: biolink:original_knowledge_source
value: infores:text-mining-provider-targeted
value_type_id: biolink:InformationResource
description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_data_source
value: infores:pubmed
value_type_id: biolink:InformationResource
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_document ## NOT CURRENTLY IN BIOLINK
value: PMID:29085514|PMID:12345678
value_type_id: biolink:Publication
description: The documents that contain the sentences that assert the Biolink association represented by the parent edge
attribute_source: infores:pubmed
- attribute_type_id: biolink:tmkp_confidence_score
value: 0.9378
value_type_id: biolink:ConfidenceLevel
description: An aggregate confidence score that combines evidence from all sentences that support the edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_study_result ## NOT CURRENTLY IN BIOLINK
value: tmkp:HCX2k2hTBVNSoReGxxsGcL33jsg
value_type_id: biolink:TextMiningResult ## NOT CURRENTLY IN BIOLINK
description: a single result from running NLP tool over a piece of text
attribute_source: infores:text-mining-provider-targeted
attributes:
- attribute_type_id: biolink:supporting_text ## NOT CURRENTLY IN BIOLINK
value: The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.
value_type_id: EDAM:data_3671 # EDAM:text
description: The text that asserts the relationship between the subject and object entity
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document ## NOT CURRENTLY IN BIOLINK
value: PMID:29085514
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document_type ## NOT CURRENTLY IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document_year ## NOT CURRENTLY IN BIOLINK
value: 2017
value_type_id: UO:0000036 # year
description: The year the document in which the sentence appears was published
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_text_located_in ## NOT CURRENTLY IN BIOLINK
value: IAO:0000315 # abstract
value_type_id: IAO_0000314 # document_part
description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
attribute_source: infores:pubmed
- attribute_type_id: biolink:extraction_confidence_score ## NOT CURRENTLY IN BIOLINK
value: 0.9995681
value_type_id: EDAM:data_1772 # EDAM:score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:subject_location_in_text ## NOT CURRENTLY IN BIOLINK
value: '31|42'
value_type_id: SIO:001056 # SIO:character_position
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:object_location_in_text ## NOT CURRENTLY IN BIOLINK
value: '104|110'
value_type_id: SIO:001056 # SIO:character_position
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:supporting_study_result
value: tmkp:6c9D9220faF116beFa1e80800D4
value_type_id: biolink:TextMiningResult ## NOT CURRENTLY IN BIOLINK
description: a single result from running NLP tool over a piece of text
attribute_source: infores:text-mining-provider-targeted
attributes:
- attribute_type_id: biolink:supporting_text ## NOT CURRENTLY IN BIOLINK
value: This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.
value_type_id: EDAM:data_3671 # EDAM:text
description: The text that asserts the relationship between the subject and object entity
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document ## NOT CURRENTLY IN BIOLINK
value: PMID:12345678
value_type_id: biolink:Publication
value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document_type ## NOT CURRENTLY IN BIOLINK
value: Journal Article
value_type_id: MESH:U000020 # publication type
description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_document_year ## NOT CURRENTLY IN BIOLINK
value: 2017
value_type_id: UO:0000036 # year
description: The year the document in which the sentence appears was published
attribute_source: infores:pubmed
- attribute_type_id: biolink:supporting_text_located_in ## NOT CURRENTLY IN BIOLINK
value: IAO:0000315 # abstract
value_type_id: IAO_0000314 # document_part
description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
attribute_source: infores:pubmed
- attribute_type_id: biolink:extraction_confidence_score ## NOT CURRENTLY IN BIOLINK
value: 0.876
value_type_id: EDAM:data_1772 # EDAM:score
description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:subject_location_in_text ## NOT CURRENTLY IN BIOLINK
value: '42|53'
value_type_id: SIO:001056 # SIO:character_position
description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
- attribute_type_id: biolink:object_location_in_text ## NOT CURRENTLY IN BIOLINK
value: '75|81'
value_type_id: SIO:001056 # SIO:character_position
description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
attribute_source: infores:text-mining-provider-targeted
Thank @bill-baumgartner, for the great documentation for the attribute semantics for text mining EPC.
As indicated in one of the previous team calls, I would simply add a (possibly optional) biolink:predicate_text_location
as an (optional) attribute (hence Biolink attribute_type_id
value). I understand your your text mining does not both with predicate tagging in sentences; however, I believe that SemMedDb provides such predicate phrase mapping, so it would be a useful extra (albeit optional) field for the model, which resources like SemMedDb could fill.
What's the status of this?
Hi @nlharris, thanks for the prompt. I'll go ahead and close this issue as it is resolved.
Hi,
I would like to represent Biolink associations using the KGX CSV/TSV file format. Is the CSV/TSV format specified somewhere? Apologies if I missed it in the documentation. Based on examples I see in the KGX unit tests I am wondering if the following columns would be appropriate to represent a GeneToExpressionSiteAssociation, for example?
Ideally I would like to include some further metadata in the publication, specifically related to the sentence from which the association was mined. Would it be appropriate to add further fields for the publication? For example:
[{id: "PMC324396", name:"?????", category:"biolink:Publication", sentence: "At E9.5 ERK5 expression was seen in the first and second branchial arch, cephalic region, somites and lateral ridge along the body wall.", subject_spans: [{start: 8, end: 12}], object_spans:[{start: 40, end: 45}, {start: 57, end: 71}]}]
Or can you recommend a different way to add the sentence and span information?
Thanks for any advice you can provide!
Best,
Bill
Edit: added by Sierra during issue triage.