Background: In replacing the use of PubTator with EXTRACT/Text2Term for infectiousAgent, species, and healthCondition augmentation, there may be more false positives introduced. Our infrastructure already supports the use of dictionary-based normalization of metadata for these fields; however, a correct term extraction for one abstract may be incorrect in another. For this reason, there is a need to enable record-based corrections for augmented metadata. Additionally, the pipeline may have biases which would be desirable to drop altogether.
Examples:
The pipeline will currently extract and map the terms 'Nevada' or 'binary' to a taxon. 'Nevada' in this context is much more likely to be a state than a species, so we need a way to remove all mentions of 'Nevada'. Similarly, 'binary' is more likely to reference an aspect of a software tool or file format than an actual species.
The term perch may refer to a fish in one record and the PERCH study in another. In this example, we'd need to be able to have the term 'perch' removed from the species field for one, but not the other
Additional considerations:
issue templates for ease and conformity of correction submission
some sort of automation to reduce burden of manual checking (in the future, if activity levels become too high to manage.)
some sort of process to prevent automatic corrections (in case of malicious or ignorant actors -- unlikely, but a possibility)
some sort of process to indicate that a correction has been applied (label? check boxes in issue template? GH project?)
Background: In replacing the use of PubTator with EXTRACT/Text2Term for
infectiousAgent
,species
, andhealthCondition
augmentation, there may be more false positives introduced. Our infrastructure already supports the use of dictionary-based normalization of metadata for these fields; however, a correct term extraction for one abstract may be incorrect in another. For this reason, there is a need to enable record-based corrections for augmented metadata. Additionally, the pipeline may have biases which would be desirable to drop altogether.Examples:
Additional considerations: