Filter high-frequency English words from KBs

herongrove commented 8 years ago

At the same time , thecatalytic activity of KSR1 was confirmed by the fact that tumour necrosis factor alpha ( TNFalpha ) and ceramide can induce KSR1 autophosphorylation and increase its capacity to phosphorylate and activate Raf-1

time is recognized as a cellosaurus cell line. This and any other cell lines that are more likely to be common English words should be filtered out during NER.

hickst commented 8 years ago

Yea, biologists seem to love these kind of computer-vexing names. We can add 'time' but the general problem is finding the "common" English words among our 28 KBs which total 4M+ entries. If you've got a list of (say a couple hundred) common words, I can check those against our KBs and add any conflicts to the stop list. On the downside, a couple hundred is a small set (unless you're really good at out-guessing the biologists).

MihaiSurdeanu commented 8 years ago

For now we can incrementally remove these by adding them to the ner_stoplist file: https://github.com/clulab/bioresources/blob/master/src/main/resources/org/clulab/reach/kb/ner_stoplist.txt

On Tue, Nov 8, 2016 at 6:41 PM, Tom Hicks notifications@github.com wrote:

Yea, biologists seem to love these kind of computer-vexing names. We can add 'time' but the general problem is finding the "common" English words among our 28 KBs which total 4M+ entries. If you've got a list of (say a couple hundred) common words, I can check those against our KBs and add any conflicts to the stop list. On the downside, a couple hundred is a small set (unless you're really good at out-guessing the biologists).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259313836, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zr-Jve5EE5_wOLATniV5TCFA__uwks5q8STlgaJpZM4Ks7XM .

hickst commented 8 years ago

Yes and that's what I am doing in this case, of course. I was just suggesting that, if we could identify more potential candidates, I could test them against the KBs for conflicts and add the conflicts to the stop list. Currently the stop list is only 33 words (now 34).

myedibleenso commented 8 years ago

We have a set of w2v embeddings trained on gigaword on the file server. I grabbed those, filtered out the few entries that didn't start with a-z, and then took their intersection with the terms listed in /usr/share/dict/words.

I don't have frequency information on these words and I know the list is a bit noisy (ex. acronyms, single letter entries), but I am curious to see how many of these terms overlap with KB entries.

Perhaps we can just look at what case-folded, single-token entries in our KBs exist in this set of words? Hopefully that is not such a large list...

vocab.txt

MihaiSurdeanu commented 8 years ago

Very nice!

On Tue, Nov 8, 2016 at 7:50 PM, Gus Hahn-Powell notifications@github.com wrote:

We have a set of w2v embeddings trained on gigaword on the file server. I grabbed those, filtered out the few entries that didn't start with a-z, and then took their intersection with the terms listed in /usr/share/dict/words. I don't have frequency information on these words and I know the list is a bit noisy (ex. acronyms, single letter entries), but I am curious to see how many of these terms overlap with KB entries. Perhaps we can just look at what case-folded, single-token entries in our KBs exist in this set of words? Hopefully that is not such a large list...

vocab.txt https://github.com/clulab/reach/files/579853/vocab.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259322508, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zs3l7YQJ3RNoWUlYjt3WIfdeRIeRks5q8TUHgaJpZM4Ks7XM .

hickst commented 8 years ago

Working on getting you that key list now.

hickst commented 8 years ago

I got the key list down to ~186k by dropping the PubChem KB and filtering entries to single "words". The file is on Jenny: /net/kate/storage/work/hickst/temp/wordKEYS

hickst commented 8 years ago

GHP said:

I just checked: it looks like 3K terms overlap. many of these are valid

MihaiSurdeanu commented 8 years ago

Let's discuss today.

On Wed, Nov 9, 2016 at 8:50 AM, Tom Hicks notifications@github.com wrote:

GHP said:

I just checked: it looks like 3K terms overlap. many of these are valid

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259447492, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zvSszmnpLDv2VZcrlbdhqL4AtxCxks5q8evRgaJpZM4Ks7XM .

myedibleenso commented 8 years ago

@hickst, can you please generate another KB dump that doesn't include uniprot and the organs kb?

hickst commented 8 years ago

@myedibleenso the new one, without Uniprot proteins and the Uberon organ KB (and w/o any chemicals) is in: /net/kate/storage/work/hickst/temp/wordKEYS2

clulab / reach

Filter high-frequency English words from KBs #443