Open herongrove opened 8 years ago
Yea, biologists seem to love these kind of computer-vexing names. We can add 'time' but the general problem is finding the "common" English words among our 28 KBs which total 4M+ entries. If you've got a list of (say a couple hundred) common words, I can check those against our KBs and add any conflicts to the stop list. On the downside, a couple hundred is a small set (unless you're really good at out-guessing the biologists).
For now we can incrementally remove these by adding them to the ner_stoplist file: https://github.com/clulab/bioresources/blob/master/src/main/resources/org/clulab/reach/kb/ner_stoplist.txt
On Tue, Nov 8, 2016 at 6:41 PM, Tom Hicks notifications@github.com wrote:
Yea, biologists seem to love these kind of computer-vexing names. We can add 'time' but the general problem is finding the "common" English words among our 28 KBs which total 4M+ entries. If you've got a list of (say a couple hundred) common words, I can check those against our KBs and add any conflicts to the stop list. On the downside, a couple hundred is a small set (unless you're really good at out-guessing the biologists).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259313836, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zr-Jve5EE5_wOLATniV5TCFA__uwks5q8STlgaJpZM4Ks7XM .
Yes and that's what I am doing in this case, of course. I was just suggesting that, if we could identify more potential candidates, I could test them against the KBs for conflicts and add the conflicts to the stop list. Currently the stop list is only 33 words (now 34).
We have a set of w2v embeddings trained on gigaword on the file server. I grabbed those, filtered out the few entries that didn't start with a-z, and then took their intersection with the terms listed in /usr/share/dict/words
.
I don't have frequency information on these words and I know the list is a bit noisy (ex. acronyms, single letter entries), but I am curious to see how many of these terms overlap with KB entries.
Perhaps we can just look at what case-folded, single-token entries in our KBs exist in this set of words? Hopefully that is not such a large list...
Very nice!
On Tue, Nov 8, 2016 at 7:50 PM, Gus Hahn-Powell notifications@github.com wrote:
We have a set of w2v embeddings trained on gigaword on the file server. I grabbed those, filtered out the few entries that didn't start with a-z, and then took their intersection with the terms listed in /usr/share/dict/words. I don't have frequency information on these words and I know the list is a bit noisy (ex. acronyms, single letter entries), but I am curious to see how many of these terms overlap with KB entries. Perhaps we can just look at what case-folded, single-token entries in our KBs exist in this set of words? Hopefully that is not such a large list...
vocab.txt https://github.com/clulab/reach/files/579853/vocab.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259322508, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zs3l7YQJ3RNoWUlYjt3WIfdeRIeRks5q8TUHgaJpZM4Ks7XM .
Working on getting you that key list now.
I got the key list down to ~186k by dropping the PubChem KB and filtering entries to single "words". The file is on Jenny: /net/kate/storage/work/hickst/temp/wordKEYS
GHP said:
I just checked: it looks like 3K terms overlap. many of these are valid
Let's discuss today.
On Wed, Nov 9, 2016 at 8:50 AM, Tom Hicks notifications@github.com wrote:
GHP said:
I just checked: it looks like 3K terms overlap. many of these are valid
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/443#issuecomment-259447492, or mute the thread https://github.com/notifications/unsubscribe-auth/ABH-zvSszmnpLDv2VZcrlbdhqL4AtxCxks5q8evRgaJpZM4Ks7XM .
@hickst, can you please generate another KB dump that doesn't include uniprot and the organs kb?
@myedibleenso the new one, without Uniprot proteins and the Uberon organ KB (and w/o any chemicals) is in: /net/kate/storage/work/hickst/temp/wordKEYS2
time
is recognized as a cellosaurus cell line. This and any other cell lines that are more likely to be common English words should be filtered out during NER.