Grounding Assist: Capture hints in paper

Description

Q: What is the name of the feature?

A: Grounding Assist

Q: What does this feature enable the user to do?

A: Indirectly, disambiguate a name for a bioentity (e.g. gene) more accurately

Q: What information must the user provide to use the feature?

A: (1) Article information (2) names of bioentities

Q: What are the applicable constraints, e.g. compatibility or performance?

A: There main cases to consider:

Default: No prior information is available
Bioentity database identifiers are available
Species information is available

Q: How does this feature affect each class of user (persona)?

A: Synonyms and orthologues account for a large proportion of observed errors (30%). It is conceivable that other types of errors could be mitigated (e.g. spelling issues) and that hints would enable features such as a true "type-ahead" autocomplete.

Uses
- curation: in normalization
- post-submission: in an automated error flagging system (even if not available at curation time)
- triage: e.g. classifier to more accurately identify potential articles and authors
- information extraction: e.g. context provided by authors
Users
- Biologist: Eventually, better search across deposited data, better discovery
- Editor: Increased quality and trust in the accuracy of Biofactoid data
- Computational biologist: Increased fidelity of Biofactoid data, better data integration
- Curator: Increased fidelity of Biofactoid curation

Specification

Sources of bioentity information

Considerations
- Entity types
- Consistent concepts (gene product, family)
- Compatible Identifiers
- Scope
- Accuracy (curated vs NLP)
- Format (file, web service)
- Latency (seconds)
- Hardware (GPU)
Providers
- Curated
- PubMed
- Natural Language Processing
- PubTator3
- Reach

Scoring algorithm

This is to be determined. Should consider:

Location: Prioritization based on mention in title vs abstract vs body
Type: Local hint (e.g. entity database IDs) vs global (e.g. species)
Reliability of source

Tasks

The factoid project should be responsible solely for obtaining bioentity hints for a given article:

1. [x] Define a Hint model
1. [x] PubTator3 Hint provider
1. [ ] Organism Hint ranking
1. [ ] General Hints API
1. [ ] Retrieve and store Hints on create/update of Document
1. [ ] Augment grounding-search query with Hints

At least for network curation, grounding-search should be responsible for scoring search hits in light of hints.

References

Entity normalization
- Chen, L., Liu, H. & Friedman, C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21, 248–256 (2005)
- Gyori, B. M. et al. Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinform Adv 2, (2022)
- Wei, C.-H. et al. GNorm2: an improved gene name recognition and normalization system. Bioinformatics 39, btad599 (2023)
Entity identification
- Luo, L. et al. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 39, (2023)
Species
- Pafilis, E. et al. The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 8, e65390 (2013)
- Wei, C.-H. et al. SR4GN: A Species Recognition Software Tool for Gene Normalization. PLoS ONE 7, e38460 (2012)
- Luo, L. et al. Assigning species information to corresponding genes by a sequence labeling framework. Database 2022, baac090 (2022)
Applications
- Wei, C.-H. et al. PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge. arXiv (2024)

PathwayCommons / factoid