How to represent sentence metadata as evidence for text-mined Biolink association using the KGX TSV format

biolink / kgx

KGX is a Python library for exchanging Knowledge Graphs

https://kgx.readthedocs.io

BSD 3-Clause "New" or "Revised" License

114 stars 26 forks source link

How to represent sentence metadata as evidence for text-mined Biolink association using the KGX TSV format #174

Closed bill-baumgartner closed 2 years ago

bill-baumgartner commented 4 years ago

Hi,

I would like to represent Biolink associations using the KGX CSV/TSV file format. Is the CSV/TSV format specified somewhere? Apologies if I missed it in the documentation. Based on examples I see in the KGX unit tests I am wondering if the following columns would be appropriate to represent a GeneToExpressionSiteAssociation, for example?

Column	Example value
gene_to_expression_site_association_subject	PR:000010159
gene_to_expression_site_association_relation	biolink:expressed_in
gene_to_expression_site_association_object	UBERON:0004362
association_id	jqkjkt-AqY3weukSpZEx_FK5QtU
association_type	biolink:GeneToExpressionSiteAssociation
publications	[{id: "PMC324396", name:"this is required but I'm not sure what would go here", category:"biolink:Publication"}]
provided_by	CRAFT Corpus (manual annotation)

Ideally I would like to include some further metadata in the publication, specifically related to the sentence from which the association was mined. Would it be appropriate to add further fields for the publication? For example:

[{id: "PMC324396", name:"?????", category:"biolink:Publication", sentence: "At E9.5 ERK5 expression was seen in the first and second branchial arch, cephalic region, somites and lateral ridge along the body wall.", subject_spans: [{start: 8, end: 12}], object_spans:[{start: 40, end: 45}, {start: 57, end: 71}]}]

Or can you recommend a different way to add the sentence and span information?

Thanks for any advice you can provide!

Best,

Bill

Edit: added by Sierra during issue triage.

[ ] document format #288
[ ] add support to TSV sink to include JSON Blob
[ ] add support to JSON lines output to include nested attributes.
[ ] resolve provided_by refactoring in Biolink-Model

deepakunni3 commented 4 years ago

@bill-baumgartner This is a great start. The KGX TSV format is described here: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md

I would recommend the following changes,

Edge Column	Example value
subject	PR:000010159
edge_label	biolink:expressed_in
object	UBERON:0004362
relation	RO:0002206
association_id	jqkjkt-AqY3weukSpZEx_FK5QtU
association_type	biolink:GeneToExpressionSiteAssociation
publications	PMC324396
provided_by	CRAFT Corpus (manual annotation)

If you have more than one publication then you can represent that as a | delimited string.

To represent text span that supports this association, it would be easier to represent them as a separate node. For example, a biolink:InformationContentEntity.

Node Column	Example value
id	uuid-123-456
name
category	biolink:InformationContentEntity
sentence	"At E9.5 ERK5 expression was seen in the first and second branchial arch, cephalic region, somites and lateral ridge along the body wall."
subject_spans	start: 8, end: 12
object_spans	start: 40, end: 45 \| start: 57, end: 71
publications	PMC324396

And refer to this node in the association, Edge Column	Example value
subject	PR:000010159
edge_label	biolink:expressed_in
object	UBERON:0004362
relation	RO:0002206
association_id	jqkjkt-AqY3weukSpZEx_FK5QtU
association_type	biolink:GeneToExpressionSiteAssociation
publications	PMC324396
provided_by	CRAFT Corpus (manual annotation)
evidence	uuid-123-456

Of course, I am overloading the biolink:InformationContentEntity node, and even the evidence property. But this is just to show a way of modeling this information.

We can create a biolink class that is specifically meant for representing text spans from NER/NLP efforts.

This is definitely worth a longer discussion, alongside our provenance discussion.

cc'ing @cmungall

bill-baumgartner commented 4 years ago

@deepakunni3, thanks! This is very helpful. Closing for now as I think you've answered all of my questions at the moment. I will reopen if others come up. Also happy to contribute to the overall evidence/provenance discussion. Thanks.

deepakunni3 commented 4 years ago

@bill-baumgartner Just a small edit to what I wrote earlier. It's good to use id field for node and edges. association_id will be deprecated.

RichardBruskiewich commented 4 years ago

Hi @bill-baumgartner, @deepakunni3 and I have been taking a fresh look at this issue this morning.

One observation we make is that the "publications" supporting a given edge may have multiple PubMed identifiers.

This would imply the existence of more than one biolink:InformationContentEntity node. Thus, the evidence identifier cannot resolve to a unique "primary" key to the sentences, since each PubMed will have its own sentence mapping.

It may therefore make more sense to use both the biolink:InformationContentEntity node and the nodes "publications" identifier as a "composite key" (sensa relational... I know, these are graphs!).

That having been said, do we even need a separate identifier for a biolink:InformationContentEntity node or should we rather instead use some identifier globally unique to the "edge" record itself (its 'id' or its 'association id')? Here we assume data curator assigned identifiers, distinct from any internal node or edge "primary key" identifier used by the database (e.g. internal node/edge identifiers for Neo4j).

Assuming this data model, a query against biolink:InformationContentEntity node would simply use both the associated 'edge' identifier and the set of 'publications' identifiers, to retrieve a set of "evidence" citations with sentence mappings.

If each biolink:InformationContentEntity node only describes one mapping againt a PubMed entry sentence, should the 'publications' field in the node be singular?

If the above data model "works" for tracking evidence, is there even any need for a separate "evidence" field in the "edge" model? (BTW, I just noted that the Biolink Model slot name is '_hasevidence' not just 'evidence')

There are probably some additional issues relating to the general theme of "provenance". Perhaps "evidence" supporting an edge is distinct from (but somewhat linked at the hips to) provenance, if one takes the meaning of "provenance" to refer to "who, what, where, when and why" did we identify and document this edge?"

We also probably have to consider other potential evidence types in the future. Perhaps the biolink:InformationContentEntity node could have flexible property content, but perhaps a "mandatory" property against a suitable ontology akin to the GO "evidence code" to guide interpretation for the evidence citation.

@deepakunni3, am I forgetting anything here?

bill-baumgartner commented 4 years ago

Hi @RichardBruskiewich and @deepakunni3, thanks for the followup. I think your composite key idea makes sense. The strategy I was planning on using is to splice the following fields of each evidence node into a string and then use the SHA1 hash as the evidence node identifier:

PMC ID
sentence text
subject span(s)
object span(s)
id for the association node (this id is a SHA1 hash of the fields in the association node)

This allows for the same sentence to be used as evidence for multiple associations. It also provides a globally unique identifier for each evidence node which we plan on using in the future to collect feedback on text-mined associations. For example, we would like a user to be able to point to a specific piece of evidence as being incorrect.

@RichardBruskiewich, I think this strategy aligns with your comment above, but if it does not, please let me know.

RichardBruskiewich commented 4 years ago

Thanks @bill-baumgartner,

Clever idea. Your points about reusing and tracking sentences (for user feedback) are well taken. Of course, the alignment of the edge semantics against a given subject/(predicate?)/object span of a given sentence in a given PMID could be globally identified using the SHA1 hash for later retrieval for user feedback use cases, etc.

In addition to the subject and object spans, might there also be a need to tag the predicate?

That said, I wonder if perhaps we might wish to more precisely nail down the specific query use cases which will need to interact with the information and assess various solutions against them? In particular, I wonder about management of potential many-to-one (let alone, many-to-many) relationships between edge (association) assertions and a given text.

What I mean is that when I visit an edge, I only have a list of PMC ID;s and the edge id (i.e. whichever id @deepakunni3 says is canonical to the edge) to use in my query. One would not yet know the sentence text and subject/object spans in advance of such a query, but one expects to get back at least one sentence hit per PMC ID (or rather PMID, since not all papers cited will be in PubMed Central?), with the salient details.

Assuming the proposed SHA1 hash is used, one might expect to have to store every hash ID for every PMC ID/sentence match, in the edge record (under the has_evidence property). For some edges, the list could become quite long and given the availability of the edge id and PMC ID's in the record already, perhaps a subtle duplication of the minimal information needed to retrieve the sentence hit supporting the edge assertion? Would not the PMID/edge_id compose query key likely always give back a unique record of subject/predicate/span + sentence. I wonder if it would be the case that the subject/predicate/object hit of a given edge mapping to a given PMID sentence will ever have multiple matches against a given sentence (even if other edges also match that sentence). Note that even in SemMedDb, the sentence 'hit' is cataloged in a separate table ("PREDICATION_AUX") from the sentence ("SENTENCE"), thus reusability of sentences is built into the data model.

Once again, one suspects that the SHA1 hash remains a very useful strategy to support other anticipated use cases here (e.g. a globally unique identifier to directly individually track multiple semantic alignments against a given sentence in a given citation). Whether or not it helps the basic edge-to-sentence-hit traversal use case might merit further review.

bill-baumgartner commented 4 years ago

@RichardBruskiewich, yes, the predicate should also be tagged. Thank you for pointing that out. This came up in a separate conversation I had yesterday as well, and was an oversight in my initial representation.

I think your idea to enumerate query use cases is a good one. My impression is that the Translator community will be querying for entity-predicate-entity triples (or triples with wildcards, e.g. entity-predicate-?, entity-?-entity, entity-?-?, etc.) and would then query for evidence supporting the triples that matched their query. Is your use case different from this? If so, can you expand on your requirements?

@deepakunni3, thank you again for editing my example representation. It seems to me that the provided_by and publications fields belong in the evidence node instead of the edge. This way, the edge represents the entity-predicate-entity triple and can be supported by different kinds of evidence, each with its own provenance. Do you have any objections to this approach?

ozborn commented 4 years ago

Just a quick follow up - is the publications link tied to PubMed? The examples from the KGX are all PubMed references.

There are a number of conference papers (ACM, IEEE, etc..) that don't have a PubMed id but do do machine learning on biological data. Could a DOI be used as well?

Also, a URL (link to the actual paper) would be nice as well, but this may not be the place to discuss KGX format changes.

RichardBruskiewich commented 4 years ago

@ozborn Excellent point. We do need to generalize this beyond PubMed, in two ways: 1) the way you indicate, using CURIES (e.g. including DOI's) which are not PMID's 2) for non-textual evidence (which won't have sentence spans, but perhaps, some other indicators into a specific knowledge source)

RichardBruskiewich commented 4 years ago

I think your idea to enumerate query use cases is a good one. My impression is that the Translator community will be querying for entity-predicate-entity triples (or triples with wildcards, e.g. entity-predicate-?, entity-?-entity, entity-?-?, etc.) and would then query for evidence supporting the triples that matched their query. Is your use case different from this? If so, can you expand on your requirements?

@deepakunni3, thank you again for editing my example representation. It seems to me that the _providedby and publications fields belong in the evidence node instead of the edge. This way, the edge represents the entity-predicate-entity triple and can be supported by different kinds of evidence, each with its own provenance. Do you have any objections to this approach?

Thanks @bill-baumgartner. I'm not too averse to your conclusions here. Here is my feedback:

on the "provided_by" field (@deepakunni3, correct me if I'm wrong) I suspect that this field is meant to tag which specific curation authority was the source of the edge (association) assertion itself. The fact that a given assertion is supported by a PubMed entry does not answer this question. That said, there is the obvious point to be made about multiple independent sources (i.e. across ARA's or KP's or with various algorithms or original data providers) of the identical "entity-predicate-entity" assertions, thus moving the "_providedby" from the edge itself into an evidence (or perhaps, broadly speaking, "provenance" node) is sensible.
You are correct in noting that the "entity-predicate-entity" triplet generally suffices for the basic use case of ARA's resolving queries to a list of assertions. Each assertion will have a "global" uniqueness to them (since each of the underlying identifiers the triplet should be "globally identified", unless (sensa RDF there are "blank nodes" hidden therein). This serves the same purpose as a global edge (or association) id.

Clearly, in the second step in the workflow of "getting the evidence", such (global) edge identifiers likely suffices to return all "information nodes" which support a given (globally identified?) edge.

In this sense, the "publications" field (as embedded in the Biolink edge record) is technically redundant, perhaps only specified as a local (performance) convenience to data processing on the edge (i.e. once you have an edge in your hands with the list of publications as PMID's, your application can directly use those id's to link out to corresponding PubMed records online). Taking @ozborn's point above, using DOI's in that context may serve the same purpose in the same context.

That said, if the overwhelming standard use case is going to be dominated by the two step use case - query-to-edge, edge-to-evidence (likely alongside broad provenance annotation) - especially as evidence for specific assertions becomes prolific and heterogeneous (i.e. not just text-mining in nature, not just PMIDs,...), then I'd say moving the "publications" field (perhaps, now made singular, not plural) to the "evidence" ("provenance"?) node is a reasonable objective.

One final point: it looks like the provenance modeling dialog is on the front burner now (especially on the #reasonerapi Slack channel. Maybe two dragons need to be slayed with one stone here.

RichardBruskiewich commented 4 years ago

Random Notes about the Evidence Model.pdf Edge Evidence

ozborn commented 4 years ago

I think the Edge Evidence table above has most of the things needed, I also like the idea of DOI inclusion although maybe it shouldn't be the primary key if there are non-DOI citations we need to include.

My main concerns is the focus on the sentence. A lot of data in publications is in tables or spans sentences, I'm not sure how to reconcile this with the schema above. If you just cite the offset of a table entry (for say binding affinity) that has some numerical value - it is hard to validate this without pulling up the paper. However, looking at the n:1 relationship between text_alignment and sentence, maybe this is not an issue since sentence could have multiple text_alignments.

One fix may be to dispense with start_char_idx and end_char_idx for "Sentence" entirely OR to rename them to min_start_char_idx and max_end_char_idx to make it explicit that these are just a bound for the "Sentence". Sentence could also be renamed to "Excerpt" or something else indicating that it does not have to be a proper English sentence.

I think this has most of what we need though.

RichardBruskiewich commented 4 years ago

Hi @ozborn,

Matt Brush's SEPIO presentation this morning in the NCATS data modeling weekly meeting represents a far more thorough and complete modelling of evidence and provenance. To some extent, the above is mainly a quick brain dump inspired mainly by the Semantic Medline Database text mining data, without any pretense of being generic or the best way forward. I do take your comments to heart. At the same time, I'll likely take a run at aligning the above brainstorming with a SEPIO profile that meets the use cases implied here. Even more so, @cmungall has mentioned the notion of ingesting SEPIO to some degree into the Biolink Model. That seems logical. Hopefully, this might lead quickly to a solid implementations pathway for us all.

RichardBruskiewich commented 4 years ago

As an aside here, the above diagram has "Person-Author-Citation" which @bill-baumgartner has rightfully suggested is not within the scope of an Evidence model (although it may be linked to it). In this spirit, I notice that Chris has opened up a fresh issue #384 about publications, for discussion.

ozborn commented 4 years ago

Thanks @RichardBruskiewich - I agree based on the presentation today that SEPIO is probably the way to go - but I haven't quite wrapped by head around all of it yet and I'm not sure to what degree we want to pull all of that in to Biolink.

I do think the notion of supporting non-pubmed publications, discontinuous spans and having critical publication metadata available (year, title, abstract) easily available/accessible without forcing the user to download (if they even can) the publication is important.

bill-baumgartner commented 3 years ago

The following is a proposal for representing EPC data for text-mined Biolink associations in the KGX TSV format. This proposal is based on yesterday's presentation during the DM call by @cmungall. It attempts to follow Option 4 to represent a text-mined Biolink association that is supported by sentences in the literature.

There are several pieces of EPC data included:

provenance information declaring the edge is asserted by the Text Mining Provider Targeted Association KP
a supporting data source declaration referencing PubMed
a count of the number of sentences that assert this edge
an aggregate confidence score for the edge (computed based on all of the sentences that assert the edge)
each document that includes at least one sentence that asserts the edge
- the publication year for the document
- the publication type for the document, e.g. journal article, review article, etc.
- each sentence asserting an edge from the document
- character offsets relative to the sentence defining where the subject and object concept mentions are located
- the confidence score for this particular sentence as provided by the underlying classification algorithm
- the zone within the document, e.g. abstract, introduction, conclusion, etc. where the sentence is located

For reference, here is @cmungall 's slide describing `Option 4`:

Use case

The use case is a biolink:ChemicalToGeneAssociation linking CHEBI:3215 (bupivacaine) to PR:000031567 (LRRC3B) using the biolink:entity_negatively_regulates_entity predicate. For the purposes of this exercise we will use the following two sentences as evidence for the assertion:

"The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells." PMID:29085514
"This is a second sentence indicating that bupivacaine negatively regulates LRRC3B." PMID:12345678

Proposed Node TSV

id	name	category
CHEBI:3215	bupivacaine	biolink:ChemicalEntity
PR:000031567	leucine-rich repeat-containing protein 3B	biolink:Protein

Proposed Edge TSV (Note: scroll table to see all columns)

subject	predicate	object	association_type	sentence_count	confidence_score	publications	_attributes
CHEBI:3215	biolink:entity_negatively_regulates_entity	PR:000031567	biolink:ChemicalToGeneAssociation	2	0.9378	PMID:29085514,PMID:12345678	`ATTRIBUTE_JSON_BLOB`

where the ATTRIBUTE_JSON_BLOB would be the following:

- attribute_type_id: biolink:original_knowledge_source
  value: infores:text-mining-provider-targeted
  value_type_id: biolink:InformationResource
  value_url: https://api.bte.ncats.io/v1/smartapi/978fe380a147a8641caf72320862697b/query/
  description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:supporting_data_source
  value: infores:pubmed
  value_type_id: biolink:InformationResource
  value_url: https://pubmed.ncbi.nlm.nih.gov/
  description: PubMed® comprises citations for biomedical literature from MEDLINE, life science journals, and online books.
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:has_evidence_count ## NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
  value: 2
  value_type_id: biolink:SentenceCount ## NOTE: THIS CLASS DOES NOT EXIST IN BIOLINK
  description: The count of the number of sentences that assert this edge
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: SEPIO:0000168  # confidence_score
  value: 0.9378
  value_type_id: biolink:ConfidenceLevel
  description: An aggregate confidence score that combines evidence from all sentences that support the edge
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
  value: PMID:29085514
  value_type_id: biolink:Publication
  value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
  description: A document that has part at least one sentence that asserts the Biolink association represented by this edge
  attribute_source: infores:pubmed
  attributes:
    - attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: Journal Article
      value_type_id: MESH:U000020 # publication type
      description: The publication type(s) for a given article as defined by PubMed; pipe-delimited
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: 2017
      value_type_id: UO:0000036  # year
      description: The year this document was published
      attribute_source: infores:pubmed 
    - attribute_type_id: SIO:000028  # has part
      value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
      value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'  
      description: A sentence asserting the Biolink association represented by the parent edge     
      attribute_source: infores:pubmed
      attributes:
        - attribute_type_id: SIO:000028  # has part
          value: '31|42'
          value_type_id: SIO:001056 # character position
          description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge.
          attribute_source:  infores:text-mining-provider-targeted
        - attribute_type_id: SIO:000028  # has part
          value: '104|110'
          value_type_id: SIO:001056 # character position
          description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge.
          attribute_source: infores:text-mining-provider-targeted           
        - attribute_type_id: SEPIO:0000440  # has_supporting_evidence   
          value: 0.99956816
          value_type_id: EDAM:data_1772     # score 
          description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
          attribute_source: infores:text-mining-provider-targeted
        - attribute_type_id: BFO:0000050  # part_of
          value: IAO_0000315 # abstract
          value_type_id: IAO_0000314 # document part
          description: The part of the document where the sentence is located
          attribute_source:  infores:text-mining-provider-targeted

- attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
  value: PMID:12345678
  value_type_id: biolink:Publication
  value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
  description: A document that has part at least one sentence that asserts the Biolink association represented by this edge
  attribute_source: infores:pubmed
  attributes:
    - attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: Journal Article
      value_type_id: MESH:U000020 # publication type
      description: The publication type(s) for a given article as defined by PubMed; pipe-delimited
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: 2021
      value_type_id: UO:0000036  # year
      description: The year this document was published
      attribute_source: infores:pubmed 
    - attribute_type_id: SIO:000028  # has part
      value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
      value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'  
      description: A sentence asserting the Biolink association represented by the parent edge     
      attribute_source: infores:pubmed
      attributes:
        - attribute_type_id: SIO:000028  # has part
          value: '42|53'
          value_type_id: SIO:001056 # character position
          description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge.
          attribute_source:  infores:text-mining-provider-targeted
        - attribute_type_id: SIO:000028  # has part
          value: '75|81'
          value_type_id: SIO:001056 # character position
          description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge.
          attribute_source: infores:text-mining-provider-targeted           
        - attribute_type_id: SEPIO:0000440  # has_supporting_evidence   
          value: 0.876
          value_type_id: EDAM:data_1772     # score 
          description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
          attribute_source: infores:text-mining-provider-targeted
        - attribute_type_id: BFO:0000050  # part_of
          value: IAO_0000305 # document title
          value_type_id: IAO_0000314 # document part
          description: The part of the document where the sentence is located
          attribute_source:  infores:text-mining-provider-targeted

@cmungall , @RichardBruskiewich , @mikebada, @mbrush - if you have a moment to review at some point I would appreciate any feedback. I think there was talk of limiting the depth of the attribute nesting to two. The representation above uses a depth of three, so perhaps this proposal is a non-starter. Thanks in advance for any comments/suggestions! - Bill

bill-baumgartner commented 3 years ago

Here is the next iteration of our use case. This version complies with the TRAPI attribute constraint that limits the nesting of attributes to a single level (see this PR), i.e. an attribute can have attributes, but its attributes cannot have attributes.

Use case

"The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells." PMID:29085514
"This is a second sentence indicating that bupivacaine negatively regulates LRRC3B." PMID:12345678

Proposed Node TSV

id	name	category
CHEBI:3215	bupivacaine	biolink:ChemicalEntity
PR:000031567	leucine-rich repeat-containing protein 3B	biolink:Protein

Proposed Edge TSV (Note: scroll table to see all columns)

subject	predicate	object	id	association_type	sentence_count	confidence_score	publications	_attributes
CHEBI:3215	biolink:entity_negatively_regulates_entity	PR:000031567	hcR2-6QIJratLDFyFxwcSO6UW1M	biolink:ChemicalToGeneAssociation	2	0.9378	PMID:29085514,PMID:12345678	`ATTRIBUTE_JSON_BLOB`

where the ATTRIBUTE_JSON_BLOB would be the following:

- attribute_type_id: biolink:original_knowledge_source
  value: infores:text-mining-provider-targeted
  value_type_id: biolink:InformationResource
  description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:supporting_data_source
  value: infores:pubmed
  value_type_id: biolink:InformationResource
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:has_evidence_count ## NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
  value: 2
  value_type_id: biolink:EvidenceCount ## NOTE: THIS CLASS DOES NOT EXIST IN BIOLINK
  description: The count of the number of sentences that assert this edge
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: SEPIO:0000168  # confidence_score
  value: 0.9378
  value_type_id: biolink:ConfidenceLevel
  description: An aggregate confidence score that combines evidence from all sentences that support the edge
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
  value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
  value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'  
  description: A sentence asserting the Biolink association represented by the parent edge     
  attribute_source: infores:pubmed
  attributes:
    - attribute_type_id: BFO:0000050  # part_of
      value: PMID:29085514
      value_type_id: biolink:Publication
      value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
      description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: Journal Article
      value_type_id: MESH:U000020 # publication type
      description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: 2017
      value_type_id: UO:0000036  # year
      description: The year the document in which the sentence appears was published
      attribute_source: infores:pubmed
    - attribute_type_id: BFO:0000050 # part_of
      value: IAO:0000315 # abstract
      value_type_id: IAO_0000314 # document_part
      description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
      attribute_source: infores:pubmed
    - attribute_type_id: SEPIO:0000440  # has_supporting_evidence   
      value: 0.99956816
      value_type_id: EDAM:data_1772     # score 
      description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
      attribute_source: infores:text-mining-provider-targeted
    - attribute_type_id: biolink:has_identifier # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: HCX2k2hTBVNSoReGxxsGcL33jsg
      value_type_id: EDAM:data_2091 # EDAM:accession
      description: A unique identifier for the combination of document/sentence/assertion.
      attribute_source: infores:text-mining-provider-targeted
    - attribute_type_id: SIO:000028  # has part
      value: '31|42'
      value_type_id: biolink:SubjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
      description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
      attribute_source:  infores:text-mining-provider-targeted
    - attribute_type_id: SIO:000028  # has part
      value: '104|110'
      value_type_id: biolink:ObjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
      description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
      attribute_source: infores:text-mining-provider-targeted           

- attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
  value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
  value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'  
  description: A sentence asserting the Biolink association represented by the parent edge     
  attribute_source: infores:pubmed
  attributes:
    - attribute_type_id: BFO:0000050  # part_of
      value: PMID:12345678
      value_type_id: biolink:Publication
      value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
      description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_publication_type # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: Journal Article
      value_type_id: MESH:U000020 # publication type
      description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
      attribute_source: infores:pubmed
    - attribute_type_id: biolink:has_year_published # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: 2021
      value_type_id: UO:0000036  # year
      description: The year the document in which the sentence appears was published
      attribute_source: infores:pubmed
    - attribute_type_id: BFO:0000050 # part_of
      value: IAO:0000315 # abstract
      value_type_id: IAO_0000314 # document_part
      description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
      attribute_source: infores:pubmed
    - attribute_type_id: SEPIO:0000440  # has_supporting_evidence   
      value: 0.876
      value_type_id: EDAM:data_1772     # score 
      description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
      attribute_source: infores:text-mining-provider-targeted
    - attribute_type_id: biolink:has_identifier # NOTE: THIS PREDICATE DOES NOT EXIST IN BIOLINK
      value: HCX2k2hTBVNSoReGxxsGcL33jsg
      value_type_id: EDAM:data_2091 # EDAM:accession
      description: A unique identifier for the combination of document/sentence/assertion.
      attribute_source: infores:text-mining-provider-targeted
    - attribute_type_id: SIO:000028  # has part
      value: '42|53'
      value_type_id: biolink:SubjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
      description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
      attribute_source:  infores:text-mining-provider-targeted
    - attribute_type_id: SIO:000028  # has part
      value: '75|81'
      value_type_id: biolink:ObjectCharacterPosition # SIO:001056 (character position) is not specific enough -- NOT PRESENT IN BIOLINK
      description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
      attribute_source: infores:text-mining-provider-targeted

bill-baumgartner commented 3 years ago

@nlharris, if there is time, would it be possible to discuss this issue at the next Biolink Help Desk?

nlharris commented 3 years ago

Up to @sierra-moxon but should be possible! I'll tag it.

nlharris commented 3 years ago

Oh wait, this is in a different repo that doesn't have all the labels we have in biolink-model. I put it on the help desk agenda.

bill-baumgartner commented 3 years ago

Thanks @nlharris! @sierra-moxon I should have tagged you here originally. Sorry about that. Let me know if you think a different venue would be more appropriate to discuss this instead of the Help Desk. Thanks!

sierra-moxon commented 3 years ago

Help desk is great! thank you for doing so much work on this, and for pointing out this ticket! We'll plan on it for next Monday. :D

mbrush commented 3 years ago

Hi Bill, thanks so much for this. Its a great start. I have some specific feedback and suggestions that would best be discussed on a call, but I am OOO the first half of next week. Any chance we can we push it to the following week's Helpdesk, or discuss on another call next week? Anytime 8/5 of after.

bill-baumgartner commented 3 years ago

Hi Matt. Very interested in your feedback so happy to push it back to the following week's Help Desk assuming that works for @sierra-moxon. Thanks!

sierra-moxon commented 3 years ago

yep that works :) sounds good!

bill-baumgartner commented 3 years ago

Provisional model of Text Mining Provider EPC metadata in the KGX format

This post summarizes discussions that have occurred over the past two weeks regarding the structure of EPC metadata for results returned by the Text Mining Provider. The use case is repeated from above for completeness. Note that this post is largely a recapitulation of both the example and figure (below) composed by @mbrush.

Screen Shot 2021-08-27 at 12 52 25 AM

Use case

"The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells." PMID:29085514
"This is a second sentence indicating that bupivacaine negatively regulates LRRC3B." PMID:12345678

Proposed Node TSV

id	name	category
CHEBI:3215	bupivacaine	biolink:ChemicalEntity
PR:000031567	leucine-rich repeat-containing protein 3B	biolink:Protein

Proposed Edge TSV (Note: scroll table to see all columns)

subject	predicate	object	id	association_type	confidence_score	supporting_study_results	supporting_publications	_attributes
CHEBI:3215	biolink:entity_negatively_regulates_entity	PR:000031567	hcR2-6QIJratLDFyFxwcSO6UW1M	biolink:ChemicalToGeneAssociation	0.9378	tmkp:HCX2k2hTBVNSoReGxxsGcL33jsg\|tmkp:6c9D9220faF116beFa1e80800D4	PMID:29085514\|PMID:12345678	`ATTRIBUTE_JSON_BLOB`

where the ATTRIBUTE_JSON_BLOB would be JSON represented by the following YAML:

- attribute_type_id: biolink:original_knowledge_source
  value: infores:text-mining-provider-targeted
  value_type_id: biolink:InformationResource
  description: The Text Mining Provider Targeted Biolink Association KP from NCATS Translator provides text-mined assertions from the biomedical literature.
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:supporting_data_source
  value: infores:pubmed
  value_type_id: biolink:InformationResource
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:supporting_document    ## NOT CURRENTLY IN BIOLINK
   value: PMID:29085514|PMID:12345678
   value_type_id: biolink:Publication
   description: The documents that contain the sentences that assert the Biolink association represented by the parent edge
   attribute_source: infores:pubmed

- attribute_type_id: biolink:tmkp_confidence_score
  value: 0.9378
  value_type_id: biolink:ConfidenceLevel
  description: An aggregate confidence score that combines evidence from all sentences that support the edge
  attribute_source: infores:text-mining-provider-targeted

- attribute_type_id: biolink:supporting_study_result    ## NOT CURRENTLY IN BIOLINK
      value: tmkp:HCX2k2hTBVNSoReGxxsGcL33jsg 
      value_type_id: biolink:TextMiningResult    ## NOT CURRENTLY IN BIOLINK
      description: a single result from running NLP tool over a piece of text     
      attribute_source: infores:text-mining-provider-targeted    
      attributes: 

        - attribute_type_id: biolink:supporting_text    ## NOT CURRENTLY IN BIOLINK
          value: The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.
          value_type_id: EDAM:data_3671   # EDAM:text
          description: The text that asserts the relationship between the subject and object entity
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document    ## NOT CURRENTLY IN BIOLINK
          value: PMID:29085514
          value_type_id: biolink:Publication
          value_url: https://pubmed.ncbi.nlm.nih.gov/29085514/
          description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document_type    ## NOT CURRENTLY IN BIOLINK
          value: Journal Article
          value_type_id: MESH:U000020 # publication type
          description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document_year    ## NOT CURRENTLY IN BIOLINK
          value: 2017
          value_type_id: UO:0000036  # year
          description: The year the document in which the sentence appears was published
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_text_located_in    ## NOT CURRENTLY IN BIOLINK
          value: IAO:0000315 # abstract
          value_type_id: IAO_0000314 # document_part 
          description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:extraction_confidence_score    ## NOT CURRENTLY IN BIOLINK  
          value: 0.9995681
          value_type_id: EDAM:data_1772     # EDAM:score 
          description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
          attribute_source: infores:text-mining-provider-targeted

        - attribute_type_id: biolink:subject_location_in_text    ## NOT CURRENTLY IN BIOLINK
          value: '31|42'
          value_type_id: SIO:001056 # SIO:character_position
          description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
          attribute_source:  infores:text-mining-provider-targeted

        - attribute_type_id: biolink:object_location_in_text    ## NOT CURRENTLY IN BIOLINK
          value: '104|110'
          value_type_id: SIO:001056 # SIO:character_position
          description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
          attribute_source: infores:text-mining-provider-targeted 

- attribute_type_id: biolink:supporting_study_result   
      value: tmkp:6c9D9220faF116beFa1e80800D4
      value_type_id: biolink:TextMiningResult    ## NOT CURRENTLY IN BIOLINK
      description: a single result from running NLP tool over a piece of text     
      attribute_source: infores:text-mining-provider-targeted    
      attributes: 

        - attribute_type_id: biolink:supporting_text    ## NOT CURRENTLY IN BIOLINK
          value: This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.
          value_type_id: EDAM:data_3671   # EDAM:text
          description: The text that asserts the relationship between the subject and object entity
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document    ## NOT CURRENTLY IN BIOLINK
          value: PMID:12345678
          value_type_id: biolink:Publication
          value_url: https://pubmed.ncbi.nlm.nih.gov/12345678/
          description: The document that contains the sentence that asserts the Biolink association represented by the parent edge
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document_type    ## NOT CURRENTLY IN BIOLINK
          value: Journal Article
          value_type_id: MESH:U000020 # publication type
          description: The publication type(s) for the document in which the sentence appears, as defined by PubMed; pipe-delimited
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_document_year    ## NOT CURRENTLY IN BIOLINK
          value: 2017
          value_type_id: UO:0000036  # year
          description: The year the document in which the sentence appears was published
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:supporting_text_located_in    ## NOT CURRENTLY IN BIOLINK
          value: IAO:0000315 # abstract
          value_type_id: IAO_0000314 # document_part 
          description: The part of the document where the sentence is located, e.g. title, abstract, introduction, conclusion, etc.
          attribute_source: infores:pubmed

        - attribute_type_id: biolink:extraction_confidence_score    ## NOT CURRENTLY IN BIOLINK  
          value: 0.876
          value_type_id: EDAM:data_1772     # EDAM:score 
          description: The score provided by the underlying algorithm that asserted this sentence to represent the assertion specified by the parent edge
          attribute_source: infores:text-mining-provider-targeted

        - attribute_type_id: biolink:subject_location_in_text    ## NOT CURRENTLY IN BIOLINK
          value: '42|53'
          value_type_id: SIO:001056 # SIO:character_position
          description: The start and end character offsets relative to the sentence for the subject of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
          attribute_source:  infores:text-mining-provider-targeted

        - attribute_type_id: biolink:object_location_in_text    ## NOT CURRENTLY IN BIOLINK
          value: '75|81'
          value_type_id: SIO:001056 # SIO:character_position
          description: The start and end character offsets relative to the sentence for the object of the assertion represented by the parent edge; start and end offsets are pipe-delimited, discontinuous spans are delimited using commas
          attribute_source: infores:text-mining-provider-targeted

RichardBruskiewich commented 3 years ago

Thank @bill-baumgartner, for the great documentation for the attribute semantics for text mining EPC.

As indicated in one of the previous team calls, I would simply add a (possibly optional) biolink:predicate_text_location as an (optional) attribute (hence Biolink attribute_type_id value). I understand your your text mining does not both with predicate tagging in sentences; however, I believe that SemMedDb provides such predicate phrase mapping, so it would be a useful extra (albeit optional) field for the model, which resources like SemMedDb could fill.

nlharris commented 2 years ago

What's the status of this?

bill-baumgartner commented 2 years ago

Hi @nlharris, thanks for the prompt. I'll go ahead and close this issue as it is resolved.

biolink / kgx

How to represent sentence metadata as evidence for text-mined Biolink association using the KGX TSV format #174

For reference, here is @cmungall 's slide describing Option 4:

Use case

Proposed Node TSV

Proposed Edge TSV (Note: scroll table to see all columns)

Use case

Proposed Node TSV

Proposed Edge TSV (Note: scroll table to see all columns)

Provisional model of Text Mining Provider EPC metadata in the KGX format

Use case

Proposed Node TSV

Proposed Edge TSV (Note: scroll table to see all columns)

For reference, here is @cmungall 's slide describing `Option 4`: