Closed eng1neer closed 7 years ago
Well, in this particular example, you seem to have found a flaw in the documentation. I might have waffled at the last minute about whether 'this' was a stopword. I should fix that. Fortunately, the standardization is about to get a lot simpler, as I found that stemming isn't necessary on ConceptNet 5.5.
When it comes to turning a sentence into a list of concepts, that's a slightly different thing. You can avoid a lot of complexity by sticking with single-word terms. (Multi-word phrases are valuable, but a process for looking them up can come later.) And you'd probably rather not put too much weight on the frequent words like 'is' or 'an', regardless of whether they'd be dropped from a multi-word concept.
I've applied this version of Conceptnet Numberbatch directly to the Story Cloze test, and it worked well compared to many other methods (despite having no representation of events or even word order). What I did -- and this is not at all standardized, and quite prone to tweaking -- is to weight the words by their log inverse frequency. You've already got wordfreq
as a dependency of ConceptNet, so you can:
wordfreq.tokenize
-log(wordfreq.word_frequency(word, 'en', 'large', default=1e-9))
@rspeer
I see, will stick to single words for now.
Thanks for the insight on how to weight the words with wordfreq
!
When using the following code from Readme:
What I get instead is
Which is of course not found in conceptnet-ensemble-201603-labels.txt
Is there a "standard" way to turn an arbitrary sentence into a list of concepts?