prevent scraping webpages for integration services

BIDS-projects / scraper

Collects data from websites of data science institutions

2 stars 0 forks source link

prevent scraping webpages for integration services #2

Open don-han opened 8 years ago

don-han commented 8 years ago

[ ] Collect webpages of Jenkins
[ ] Build Naive Bayes Classifier to filter out unnecessary server-side webpages

alvinwan commented 8 years ago

How would Naive Bayes help us filter webpages? I'm having a hard time seeing how websites could be fit to a conditional probability model.

don-han commented 8 years ago

It's like building a spam filter with Naive Bayes. Since Naive Bayes is a text classifier, we can use it to classify "integration services" like Jenkins and normal webpages. My assumption is that since integration services have their own jargons such as "idle", "workers", and "builders", our classifier should easily distinguish integration services vs. normal pages.

don-han commented 8 years ago

Also, the reason why I chose Naive Bayes over other classification methods is that Naive Bayes is super-fast, and given that we are processing hundreds of thousands of web pages, we can't afford to run slow algorithms. It definitely is not a perfect algorithm, but I am thinking of adding redundancy to improve the false positives and false negatives

alvinwan commented 8 years ago

Mm, all right. Thanks! I don't think I truly understand what Bayes is for, but hopefully taking 188 next semester will help. :P I'll just tag along with you as you code, and I'll try to contribute where I can.

don-han commented 8 years ago

Temporary measure by blacklisting Jenkins implemented: 28170f8b5d4c458fed9892928cd202f385999f18