commonsense / conceptnet-numberbatch

Other
1.28k stars 143 forks source link

Converting sentence into a list of concepts? #35

Closed eng1neer closed 7 years ago

eng1neer commented 8 years ago

When using the following code from Readme:

>>> from conceptnet5.nodes import standardized_concept_uri

>>> standardized_concept_uri('en', 'this is an example')
'/c/en/be_example'

What I get instead is

{
  "uri": "/c/en/this_be_example"
}

Which is of course not found in conceptnet-ensemble-201603-labels.txt

Is there a "standard" way to turn an arbitrary sentence into a list of concepts?

rspeer commented 8 years ago

Well, in this particular example, you seem to have found a flaw in the documentation. I might have waffled at the last minute about whether 'this' was a stopword. I should fix that. Fortunately, the standardization is about to get a lot simpler, as I found that stemming isn't necessary on ConceptNet 5.5.

When it comes to turning a sentence into a list of concepts, that's a slightly different thing. You can avoid a lot of complexity by sticking with single-word terms. (Multi-word phrases are valuable, but a process for looking them up can come later.) And you'd probably rather not put too much weight on the frequent words like 'is' or 'an', regardless of whether they'd be dropped from a multi-word concept.

I've applied this version of Conceptnet Numberbatch directly to the Story Cloze test, and it worked well compared to many other methods (despite having no representation of events or even word order). What I did -- and this is not at all standardized, and quite prone to tweaking -- is to weight the words by their log inverse frequency. You've already got wordfreq as a dependency of ConceptNet, so you can:

eng1neer commented 8 years ago

@rspeer

I see, will stick to single words for now.

Thanks for the insight on how to weight the words with wordfreq!