Closed bgyori closed 4 years ago
It does actually break some Reach tests that I am now looking into.
Ok. Let me know when I can release this.
@MihaiSurdeanu I got to a point where test failures related to the previous PR are fixed. Now what I find is that some of the new entries in the stoplist are not being excluded in the results. One example is m2
(often appears representing meters square in text) that I put in the stoplist but it is still extracted and grounded because there is a chemical called M2
in both chebi.tsv and PubChem.tsv (those are otherwise valid entries in and of themselves but the capitalization differs). To resolve this, should I remove these entries from chebi.tsv and PubChem.tsv?
Yes, I think this is the simplest solution. We could add a pattern for this to stop words, but I think I've seen valid protein names that are this short. Thanks again!
On Thu, Apr 23, 2020 at 2:45 PM Benjamin M. Gyori notifications@github.com wrote:
@MihaiSurdeanu https://github.com/MihaiSurdeanu I got to a point where test failures related to the previous PR are fixed. Now what I find is that some of the new entries in the stoplist are not being excluded in the results. One example is m2 (often appears representing meters square in text) that I put in the stoplist but it is still extracted and grounded because there is a chemical called M2 in both chebi.tsv and PubChem.tsv (those are otherwise valid entries in and of themselves but the capitalization differs). To resolve this, should I remove these entries from chebi.tsv and PubChem.tsv?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/32#issuecomment-618688509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TQLSNUNRZSZZN3UD6TROCZI3ANCNFSM4MPHKTUA .
Alright, m2
is going to be resolved. Another example is k2
which I put in the stoplist but still gets extracted as a Site, see:
MENTION TEXT: k2
LABELS: List(Site)
DISPLAY LABEL: Site
------------------------------
RULE => site_1letter_a
TYPE => CorefTextBoundMention
------------------------------
GROUNDING: <KBResolution: k2, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>
what can we do about this one?
Thanks!
I am not sure we need to do anything about "k2". Even if gets extracted as a site, this spurious extraction will not matter unless it gets attached to an event, which is unlikely, no?
I had to remove some entries due to the case-insensitivity issue but now all tests are passing with the chebi
branch of Reach.
Are all these changes in Reach? Can I release the master of bioresources, since this PR was merged?
This PR extends the stoplist based on strings we have observed over the years that are common and either processing artifacts (e.g.,
[
) or common English words (e.g.,net
) that through synonyms get grounded to a named entity leading to incorrect extractions. I also sorted the list alphabetically.I haven't yet run Reach tests to see if this breaks anything.