BIDS-projects / topic-modeling

Categorization of various data science institutions into several different topics
Apache License 2.0
1 stars 0 forks source link

Better filtering of features #18

Closed don-han closed 8 years ago

don-han commented 8 years ago

image

Possible categories of stop words:

To think about: Facebook twitter

don-han commented 8 years ago

@BIDS-projects/topic-modeling I think using regex might make our life easier for number 2. I did try using 3rd library, but it is extremely slow, and it's basically doing what regex could easily do.

For no.1 and no.3, we would have to start with adding stop words into the scikit-learn model unless we can implement functions that resemble Evernote Clearly or Readability, which seem to be out-of-scope for our project.

Readability does have Parser API, and I sent them an email in regards to the request cap, but I really doubt it would handle our case since we will be requesting thousands of webpages.

don-han commented 8 years ago

Actually, look into "Named Entity Recognition". It might help us immensely with no.1

chewisinho commented 8 years ago

21 : Boilerplates mostly removed, but still remnants such as "spambots", "email address", "dot edu", "javascript enabled view"