Snippets generally often do not support the reported 'Relationship'

mbrush commented 10 months ago

This is a general trend I've noted - where the snippet presented for a text mined edge from TMKP or SemmedDB often does not contain text /assertions supporting the extracted relationship. In many cases, they do not even contain either concept in the relationship.

A couple examples are shown below, found in support paths for various results for 'What may treat Cerebral Palsy' (PK) - but this seems to be a very common phenomenon, which is likely to degrade trust in the system.

Why does this happen:

It may be that the concepts / relationship is reported below the cut of the provided snippet. If this is often the cause of this problem - the UI should show more of the snippet.
It may just be that concept recognition by NLP tools is wrong in these cases, and we jsut have to live with this as a fact of life. But in such cases, I would suspect that the accuracy score reported by the NLP tool would reflect a lack of confidence, and we could use this to filter or order the snippets that get returned to the user. Or find some other creative way to validate extraction accuracy so we don;t continually show users snippets that do not support the reported relationship.

bill-baumgartner commented 10 months ago

Even though the column header is labeled "snippet", it's my understanding that the UI does not show the sentence (snippet) from which an assertion was extracted, but instead is showing the beginning of the abstract. This seems to be the case in the examples above: https://pubmed.ncbi.nlm.nih.gov/29558816/ https://pubmed.ncbi.nlm.nih.gov/22357313/

I believe the UI team is working towards displaying the extracted sentences.

andrewsu commented 10 months ago

it's my understanding that the UI does not show the sentence (snippet) from which an assertion was extracted, but instead is showing the beginning of the abstract. I believe the UI team is working towards displaying the extracted sentences.

@Genomewide can you confirm the above? (if there is another issue tracking this change, it would be useful to link that issue here...)

bill-baumgartner commented 2 months ago

From a TMKP perspective, I think we can close this issue. The UI now makes use of the sentence-level EPC data provided in TMKP results to show the sentence from which an assertion was mined (See screenshot below from this query).

bill-baumgartner commented 2 months ago

From a SemMed perspective, I think this is still an open issue. For SemMed results, the snippet displayed by the UI appears to be the beginning of the abstract (see screenshot above).

So, @sierra-moxon, I'll defer to you on whether we close this one and perhaps open another that is specific to SemMed, or remove the TMKP tag and re-add the SemMed tag to this current issue. Thanks!

andrewsu commented 2 months ago

For the same zinc - increases - NFE2L2 edge above, this is what is reported from semmeddb through BTE:

and here is the snippet of the TRAPI response:

                "d8ece4f78faf9e4bc0c270500ab18da8": {
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:24597671",
                                "PMID:25994789",
                                "PMID:16723490",
                                "PMID:23536959",
                                "PMID:23868099",
                                "PMID:33198336"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        },
                        {
                            "attribute_type_id": "biolink:supporting_text",
                            "value": [
                                "Therefore, Zn up-regulates Nrf2 function via activating Akt-mediated inhibition of Fyn function.",
                                "We aim to investigate whether the intracellular free zinc change plays a role in Nrf2 activation.",
                                "The increase of intracellular free zinc may be one mechanism for Nrf2 activation.",
                                "CONCLUSIONS: Induction of the ARE-Nrf2 pathway by zinc provides powerful and prolonged antioxidation and detoxification that may explain the beneficial effects of zinc observed in the treatment of age-related macular degeneration (AMD).",
                                "There was gender difference for the protective effect of zinc against diabetes-induced pathogenic changes and the up-regulated levels of Nrf2 and MT in the aorta.",
                                "The aortic protection by zinc against diabetes-induced pathogenic changes is associated with the up-regulation of both MT and Nrf2 expression.",
                                "This assumption was supported by the observations that knockdown of Nrf2 expression compromised the zinc-induced increase in HO-1 gene transcription, and that zinc increased Nrf2 protein expression and the Nrf2 binding to the AREs.",
                                "In addition, NAC inhibited the Zn-induced Nrf2 activation and limited the concomitant upregulation of cellular GSH concentrations."
                            ]
                        }
                    ],
                    "object": "NCBIGene:4780",
                    "predicate": "biolink:affects",

So the snippets are definitely there. However, I see there are six PMIDs and eight snippets, so I understand that the one-to-many- relationship means UI probably doesn't quite know how to match those up. If the UI team (@Genomewide) wants to tell us how to format those edge attributes (or @bill-baumgartner if you can comment how you handle cases like this), we can make the adjustments to our output TRAPI...

bill-baumgartner commented 2 months ago

Sure thing @andrewsu - We worked with @mbrush to develop a representation for text-mined EPC. This issue summarizes the outcome of those discussions. Briefly, each sentence that asserts a subject-predicate-object triple is represented as a biolink:TextMiningResult and various attributes are attached in order to represent things like the character positions of the subject and object within the sentence, the text of the sentence, etc. An example of the nested attribute structure is shown below:

{
              "attribute_source": "infores:text-mining-provider-targeted",
              "attribute_type_id": "biolink:has_supporting_study_result",
              "attributes": [
                {
                  "attribute_source": "infores:text-mining-provider-targeted",
                  "attribute_type_id": "biolink:supporting_text",
                  "value": "RT-qPCR analysis demonstrated that supplemental zinc notably enhanced the transcription of SOD, GPX, GR, CAT, and nuclear factor erythroid 2-related factor 2 (Nrf2) (P < 0.05).",
                  "value_type_id": "EDAM:data_3671"
                },
                {
                  "attribute_source": "infores:pubmed",
                  "attribute_type_id": "biolink:publications",
                  "value": "PMID:30343482",
                  "value_type_id": "biolink:Uriorcurie",
                  "value_url": "https://pubmed.ncbi.nlm.nih.gov/30343482/"
                },
                {
                  "attribute_source": "infores:pubmed",
                  "attribute_type_id": "biolink:supporting_text_located_in",
                  "value": "abstract",
                  "value_type_id": "IAO_0000314"
                },
                {
                  "attribute_source": "infores:text-mining-provider-targeted",
                  "attribute_type_id": "biolink:extraction_confidence_score",
                  "value": 0.645050506678129,
                  "value_type_id": "EDAM:data_1772"
                },
                {
                  "attribute_source": "infores:text-mining-provider-targeted",
                  "attribute_type_id": "biolink:subject_location_in_text",
                  "value": "48|52",
                  "value_type_id": "SIO:001056"
                },
                {
                  "attribute_source": "infores:text-mining-provider-targeted ",
                  "attribute_type_id": "biolink:object_location_in_text",
                  "value": "159|163",
                  "value_type_id": "SIO:001056"
                },
                {
                  "attribute_source": "infores:pubmed",
                  "attribute_type_id": "biolink:supporting_document_year",
                  "value": 2019,
                  "value_type_id": "UO:0000036"
                }
          ],
              "value": "tmkp:0876c3223ff40674b85d35679a1c1b539a95c764a1544c5ad8a7075a54247351",
              "value_type_id": "biolink:TextMiningResult",
              "value_url": "https://tmui.text-mining-kp.org/evidence/0876c3223ff40674b85d35679a1c1b539a95c764a1544c5ad8a7075a54247351"
            },

Genomewide commented 1 month ago

Sorry, I had not responded earlier to this. I would love to have the snippets for semmeddb! If they are similar in format to TMPK that fits with trapi even better! Tagging @dnsmith124 and @gprice1129

I thought I was told that these did not exist except for TMPK. So, apologies for not showing them! They are one of the biggest time savers for the text-mined edges!

I added a clickup ticket.

bill-baumgartner commented 1 month ago

Related to this issue and to #803, there are PMIDs that have been processed by Semmed that are no longer in PubMed. This can result in missing publication details as was the topic of #803 (I was going to post to that issue but since it's been closed, I'll post here). It looks as though there are ~120k PMIDs in Semmed that are no longer in PubMed -- some were duplicate records, and some were simply removed as far as I can tell. I've compiled the list (in_semmed_not_in_pubmed.pmids.gz) in case it's helpful for either those KPs that serve Semmed, or perhaps for the UI to filter the results.

Genomewide commented 1 month ago

How do we get the KPs that provide PMIDs to filter this? We can not do anything about this? @andrewsu Do you filter these out? Are there other teams that need to look at this group to filter that you know of?

andrewsu commented 1 month ago

Created this separate issue to track the issue of missing/deprecated PMIDs

colleenXu commented 1 month ago

@Genomewide @dnsmith124 @gprice1129

Given the entire text-mining edge attribute info, what does the UI actually consume/require? Is it okay if we only have the sentence/snippet and publication ID (biolink:supporting_text, biolink:publications)?

Does the UI only handle 1 sentence per publication? It looks like that's what Text-Mining Targeted is providing (@bill-baumgartner ?).

Does the UI have a limit on how many publications to show? It also looks like Text-Mining Targeted is providing this detailed edge-attribute info on a max of 5 unique publications, even when the biolink:evidence_count > 5 (@bill-baumgartner ?).

Genomewide commented 1 month ago

@gprice1129 @dnsmith124 What comes from TRAPI and what comes from the publication service?

dnsmith124 commented 1 month ago

Here's what a publication looks like when the UI's FE receives it from the BE:

{
    "type": "PMC",
    "url": "https://www.ncbi.nlm.nih.gov/pmc/PMC57751",
    "source": {
        "name": "Text Mining Targeted Association API",
        "url": "https://github.com/NCATSTranslator/Translator-All/wiki/Text%E2%80%90mined-Assertion-KP",
        "knowledge_level": "ml"
    }
}

We send a list of ids to the publication service like so:

https://docmetadata.transltr.io/publications?pubids=PMID:24477236,PMID:9194401,PMID:7597009&request_id=abcd1234

Here's an example of what's returned from the publication service:

        "PMID:10345257": {
            "abstract": "Aspirin use seems to reduce coronary artery disease events in some groups...",
            "article_title": "Does professional advice influence aspirin use to prevent heart disease in an HMO population?",
            "issue": "",
            "journal_abbrev": "Eff Clin Pract",
            "journal_name": "Effective clinical practice : ECP",
            "pub_day": "",
            "pub_month": "Aug",
            "pub_year": "1998",
            "volume": ""
        },

TLDR: We receive the url, the id, and the source (and the snippet and subject/object, not shown here) from TRAPI and the rest comes from the publication service.

Genomewide commented 1 month ago

But we need more to show the snippet and the subject and object right?

dnsmith124 commented 1 month ago

Apologies, that's correct. The snippet and subject&object come to the FE from the BE response, I believe that is in TRAPI as well. I've edited my comment above to reflect that, and @gprice1129 can provide more info on that poriton.

bill-baumgartner commented 1 month ago

@colleenXu

Given the entire text-mining edge attribute info, what does the UI actually consume/require? Is it okay if we only have the sentence/snippet and publication ID (biolink:supporting_text, biolink:publications)?

If you also provide the character offsets relative to the sentence for the subject and object mentions, then the UI can highlight/bold the the subject and object in the text. See the biolink:subject_location_in_text attribute for an example.

Does the UI only handle 1 sentence per publication? It looks like that's what Text-Mining Targeted is providing (@bill-baumgartner ?).

It is possible for TMKP results to return more than one sentence per publication. I'm not sure if the UI only displays one sentence per publication or not.

Does the UI have a limit on how many publications to show? It also looks like Text-Mining Targeted is providing this detailed edge-attribute info on a max of 5 unique publications, even when the biolink:evidence_count > 5 (@bill-baumgartner ?).

There is currently a limit of 5 sentences that are returned. @Genomewide has requested that the limit be raised to 50 which will we do in an upcoming release.

gprice1129 commented 1 month ago

@colleenXu for the text mining edge attributes we can consume the following: biolink:publications, biolink:supporting_text, biolink:subject_location_in_text, biolink:object_location_in_text, and in that case we will show the extracted sentence and highlight the subject/object in the UI.

The only thing we actually require is a publication ID, but in that case we will fall back to showing the abstract for the snippet with no highlighting for the subject/object location information (assuming the service we call out to for PMIDs actually has the abstract information).

In the context of a single edge, we currently only support a single sentence for a publication. We have considered supporting multiple sentences for a single publication but decided it was not a priority at the time because it was very rarely occurring. If it is happening more often now we should revisit if we want to support this feature @Genomewide.

The UI does not limit how many publications it shows.

gprice1129 commented 1 month ago

The UI team does not think that supporting multiple sentences per publication has much value to a user over a single sentence. If possible we would definitely like to see additional context for Semmed publications if that is available.

colleenXu commented 3 weeks ago

@gprice1129

Does the UI use the top-level biolink:publications edge-attribute (with the list of publication IDs)? Or does it only use the sub-attribute biolink:publications (each individual publication ID)?

NCATSTranslator / Feedback

Snippets generally often do not support the reported 'Relationship' #625