Closed don-han closed 8 years ago
@BIDS-projects/topic-modeling I think using regex
might make our life easier for number 2. I did try using 3rd library, but it is extremely slow, and it's basically doing what regex could easily do.
For no.1 and no.3, we would have to start with adding stop words into the scikit-learn model unless we can implement functions that resemble Evernote Clearly or Readability, which seem to be out-of-scope for our project.
Readability does have Parser API, and I sent them an email in regards to the request cap, but I really doubt it would handle our case since we will be requesting thousands of webpages.
Actually, look into "Named Entity Recognition". It might help us immensely with no.1
Possible categories of stop words:
To think about: Facebook twitter