NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Consider moving beyond PubMed/PMC as "ground truth" within Translator #160

Closed karafecho closed 6 months ago

karafecho commented 1 year ago

This issue is to suggest that we (the Translator Consortium) consider including sources other than PubMed/PMC as "ground truth" for all publication support and NLM-derived assertions.

For context, the issue was first suggested on a Data Modeling call, with a use case that Vlado put forward related to modeling supporting publications. Specifically, MolePro identified a number of unstructured publications that didn't fit with the Data Modeling group's proposal for modeling supporting publications.

`The lipid handbook with CD-ROM
Thematic Review Series: Glycerolipids. Phosphatidylcholine and choline homeostasis
Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids
Tinto WF, Reynolds WF, Seaforth CE, Mohammed S, Maxwell A. New bitter saponins from the bark of Colubrina elliptica: 1H and 13C assignments by 2D NMR spectroscopy. Magnetic resonance in chemistry 1993;31(9):859-864. [Structure]
Toranosuke Saito, 'Nuclear substituted salicylic acids and their salts.' U.S. Patent US5049685, issued November, 1979.: http://www.google.ca/patents/US5049685
Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443
Toshihito Kumagai, Takeshi Kuwada, Tsuyoshi Shibata, Masato Hayashi, Yuri Fujisawa, Yoshinori Sekiguchi, '1,3-Dihydro-2h-indol-2-one derivative.' U.S. Patent US20060276449, issued December 07, 2006.: http://www.google.ca/patents/US20060276449
Toshihito Kumagai, Takeshi Kuwada, Tsuyoshi Shibata, Masato Hayashi, Yuri Fujisawa, Yoshinori Sekiguchi, '1,3-dihydro-2H-indol-2-one derivative.' U.S. Patent US07528124, issued May 05, 2009.: http://www.google.ca/patents/US07528124
Toshimitsu Sakuma, Kozo Sato, 'NOVEL DEHYDROABIETIC ACID POLYMER.' U.S. Patent US20120101250, issued April 26, 2012.: http://www.google.ca/patents/US20120101250`

Gustavo commented in Slack that the Tinto reference wasn't picked up because it was published in a journal that is not indexed in PubMed.

That post initiated a response from me and prompted a thread exchange involving me, Gustavo, Anne, and Sandrine.

Here's the main issue:

We (the Translator Consortium) rely heavily on PubMed as a "ground-truth" source for all publication support and NLM-derived assertions. However, PubMed is but one of many indexing services. Moreover, it is biased toward /specialized for the biomedical and life sciences; it does not index all biomedical journals; and it also does not include (many) publications from other fields of relevance to Translator, e.g., chemistry, social sciences. I'm not sure if this is worth a broader discussion, but I did want to point that out, as Vlado's use-case example perhaps suggests a need to broaden our reach a bit.

This issue is clearly not a priority for the September release, but should be considered as a future priority.

LEHunter commented 1 year ago

+1

We should also plan to do patents, which are a vitally important source of relevant information

Larry

On Apr 14, 2023, at 5:55 PM, karafecho @.***> wrote:

[External Email - Use Caution]

This issue is to suggest that we (the Translator Consortium) consider including sources other than PubMed/PMC as "ground truth" for all publication support and NLM-derived assertions.

For context, the issue was first suggested on a Data Modeling call, with a use case that Vlado put forward related to modeling supporting publications. Specifically, MolePro identified a number of unstructured publications that didn't fit with the Data Modeling group's proposal for modeling supporting publications.

The lipid handbook with CD-ROM Thematic Review Series: Glycerolipids. Phosphatidylcholine and choline homeostasis Thematic Review Series: Glycerolipids. Phosphatidylserine and phosphatidylethanolamine in mammalian cells: two metabolically related aminophospholipids Tinto WF, Reynolds WF, Seaforth CE, Mohammed S, Maxwell A. New bitter saponins from the bark of Colubrina elliptica: 1H and 13C assignments by 2D NMR spectroscopy. Magnetic resonance in chemistry 1993;31(9):859-864. [Structure] Toranosuke Saito, 'Nuclear substituted salicylic acids and their salts.' U.S. Patent US5049685, issued November, 1979.: http://www.google.ca/patents/US5049685 Toranosuke Saito, Takashi Ishibashi, Tomoharu Shiozaki, Tetsuo Shiraishi, 'Developer for pressure-sensitive recording sheets, aqueous dispersion of the developer and method for preparing the developer.' U.S. Patent US5118443, issued September, 1986.: http://www.google.ca/patents/US5118443 Toshihito Kumagai, Takeshi Kuwada, Tsuyoshi Shibata, Masato Hayashi, Yuri Fujisawa, Yoshinori Sekiguchi, '1,3-Dihydro-2h-indol-2-one derivative.' U.S. Patent US20060276449, issued December 07, 2006.: http://www.google.ca/patents/US20060276449 Toshihito Kumagai, Takeshi Kuwada, Tsuyoshi Shibata, Masato Hayashi, Yuri Fujisawa, Yoshinori Sekiguchi, '1,3-dihydro-2H-indol-2-one derivative.' U.S. Patent US07528124, issued May 05, 2009.: http://www.google.ca/patents/US07528124 Toshimitsu Sakuma, Kozo Sato, 'NOVEL DEHYDROABIETIC ACID POLYMER.' U.S. Patent US20120101250, issued April 26, 2012.: http://www.google.ca/patents/US20120101250

Gustavo commented in Slack that the Tinto reference wasn't picked up because it was published in a journal that is not indexed in PubMed.

That post initiated a response from me and prompted a thread exchange involving me, Gustavo, Anne, and Sandrine.

Here's the main issue:

We (the Translator Consortium) rely heavily on PubMed as a "ground-truth" source for all publication support and NLM-derived assertions. However, PubMed is but one of many indexing services. Moreover, it is biased toward /specialized for the biomedical and life sciences; it does not index all biomedical journals; and it also does not include (many) publications from other fields of relevance to Translator, e.g., chemistry, social sciences. I'm not sure if this is worth a broader discussion, but I did want to point that out, as Vlado's use-case example perhaps suggests a need to broaden our reach a bit.

This issue is clearly not a priority for the September release, but should be considered as a future priority.

— Reply to this email directly, view it on GitHubhttps://github.com/NCATSTranslator/Feedback/issues/160, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACWZKKEONLBEGPHMENKQ7TXBHPYLANCNFSM6AAAAAAW7AU2XM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

sierra-moxon commented 1 year ago

From DM call: We decided to support "text-based" sources when CURIEs or URIs can not be established. We encouraged KPs to reach out to sources that routinely provide PubMed articles in citations without identifiers to ask for an update to their data distribution to include identifiers and to provide them in a parsable format. For pubs not in PubMed, we could link to the journal site itself, using URIs for this attribution.

karafecho commented 1 year ago

Thanks! I think this is a broader discussion, however, one that extends beyond the modeling issue to whether or not we (the Consortium) want to move beyond PubMed/PMC for NLP, to provide one example.

sierra-moxon commented 1 year ago

@karafecho - should this move to a discussion in the Architecture group? Or is it a relay session? (somewhere else?). Real question is: is this best discussed in TAQA or somewhere else? (happy to have it start in TAQA if that is best) :)

karafecho commented 1 year ago

So, I posted a ticket after a thread discussion with me, Gustavo, Anne, and Sandrine. We all agreed that we need to capture the thought, but we weren't sure how/where to do so. I posted a ticketed to the Feedback repo simply because we don't really have a 'parking lot' repo. Perhaps we can label this as a 'data gap' issue and let it rest in the Feedback repo for revisiting after the September release?

sierra-moxon commented 1 year ago

@karafecho - do you think we should revisit this as a larger group?

karafecho commented 1 year ago

Perhaps as part of TACT's release planning for the next (not the current) release cycle?

sierra-moxon commented 6 months ago

moved to the TACT parking lot.