Closed don-han closed 8 years ago
Boilerplates mostly removed, but still remnants such as "spambots", "email address", "dot edu", "javascript enabled view"
Hm, do they appear after topic modeling?
They appear after using justext and are fed into topic-modeling. We need to filter them out (I think justext got rid of most of them, like "skip main content")
Implemented at BIDS-projects/ETL
Refer #18