BIDS-projects / topic-modeling

Categorization of various data science institutions into several different topics
Apache License 2.0
1 stars 0 forks source link

remove boilerplates using justText or other 3rd library #21

Closed don-han closed 8 years ago

don-han commented 8 years ago

Refer #18

chewisinho commented 8 years ago

Boilerplates mostly removed, but still remnants such as "spambots", "email address", "dot edu", "javascript enabled view"

don-han commented 8 years ago

Hm, do they appear after topic modeling?

chewisinho commented 8 years ago

They appear after using justext and are fed into topic-modeling. We need to filter them out (I think justext got rid of most of them, like "skip main content")

don-han commented 8 years ago

Implemented at BIDS-projects/ETL