don-han commented 8 years ago

Possible categories of stop words:

[x] 1. geographical locations
[x] 2. times / dates / numbers
[x] 3. remove boilerplates (menu skip contents search form search)

To think about: Facebook twitter

don-han commented 8 years ago

@BIDS-projects/topic-modeling I think using regex might make our life easier for number 2. I did try using 3rd library, but it is extremely slow, and it's basically doing what regex could easily do.

For no.1 and no.3, we would have to start with adding stop words into the scikit-learn model unless we can implement functions that resemble Evernote Clearly or Readability, which seem to be out-of-scope for our project.

Readability does have Parser API, and I sent them an email in regards to the request cap, but I really doubt it would handle our case since we will be requesting thousands of webpages.

don-han commented 8 years ago

Actually, look into "Named Entity Recognition". It might help us immensely with no.1

chewisinho commented 8 years ago

21 : Boilerplates mostly removed, but still remnants such as "spambots", "email address", "dot edu", "javascript enabled view"

BIDS-projects / topic-modeling

Better filtering of features #18

21 : Boilerplates mostly removed, but still remnants such as "spambots", "email address", "dot edu", "javascript enabled view"