commonsense / conceptnet-numberbatch

Other
1.28k stars 143 forks source link

List of removed stop words #31

Closed shirish93 closed 8 years ago

shirish93 commented 8 years ago

Section 3.1 of the paper (second para) says some stop words were removed while pre-processing. Would there be a list of the words that were removed? Some very common stop words appear to be around, so just wanted to be sure which ones had been knowingly gotten rid of.

rspeer commented 8 years ago

The only stopwords removed were "the", "a", and "an" (when they are not the only word), and "to" (when it's the first word).

Many stopwords are needed in particular contexts, so the idea here was to remove a very conservative list of stopwords so that -- for example -- "the Internet" is the same term as "internet", and "to run" is the same term as "run".

See https://github.com/commonsense/conceptnet5/blob/master/conceptnet5/language/english.py for the actual code.