clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Add new entries to stoplist #32

Closed bgyori closed 4 years ago

bgyori commented 4 years ago

This PR extends the stoplist based on strings we have observed over the years that are common and either processing artifacts (e.g., [) or common English words (e.g., net) that through synonyms get grounded to a named entity leading to incorrect extractions. I also sorted the list alphabetically.

I haven't yet run Reach tests to see if this breaks anything.

bgyori commented 4 years ago

It does actually break some Reach tests that I am now looking into.

MihaiSurdeanu commented 4 years ago

Ok. Let me know when I can release this.

bgyori commented 4 years ago

@MihaiSurdeanu I got to a point where test failures related to the previous PR are fixed. Now what I find is that some of the new entries in the stoplist are not being excluded in the results. One example is m2 (often appears representing meters square in text) that I put in the stoplist but it is still extracted and grounded because there is a chemical called M2 in both chebi.tsv and PubChem.tsv (those are otherwise valid entries in and of themselves but the capitalization differs). To resolve this, should I remove these entries from chebi.tsv and PubChem.tsv?

MihaiSurdeanu commented 4 years ago

Yes, I think this is the simplest solution. We could add a pattern for this to stop words, but I think I've seen valid protein names that are this short. Thanks again!

On Thu, Apr 23, 2020 at 2:45 PM Benjamin M. Gyori notifications@github.com wrote:

@MihaiSurdeanu https://github.com/MihaiSurdeanu I got to a point where test failures related to the previous PR are fixed. Now what I find is that some of the new entries in the stoplist are not being excluded in the results. One example is m2 (often appears representing meters square in text) that I put in the stoplist but it is still extracted and grounded because there is a chemical called M2 in both chebi.tsv and PubChem.tsv (those are otherwise valid entries in and of themselves but the capitalization differs). To resolve this, should I remove these entries from chebi.tsv and PubChem.tsv?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/32#issuecomment-618688509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TQLSNUNRZSZZN3UD6TROCZI3ANCNFSM4MPHKTUA .

bgyori commented 4 years ago

Alright, m2 is going to be resolved. Another example is k2 which I put in the stoplist but still gets extracted as a Site, see:

MENTION TEXT:  k2
LABELS:        List(Site)
DISPLAY LABEL: Site
    ------------------------------
    RULE => site_1letter_a
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: k2, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

what can we do about this one?

MihaiSurdeanu commented 4 years ago

Thanks!

I am not sure we need to do anything about "k2". Even if gets extracted as a site, this spurious extraction will not matter unless it gets attached to an event, which is unlikely, no?

bgyori commented 4 years ago

I had to remove some entries due to the case-insensitivity issue but now all tests are passing with the chebi branch of Reach.

MihaiSurdeanu commented 4 years ago

Are all these changes in Reach? Can I release the master of bioresources, since this PR was merged?