akhilpandey95 / tribble

It is a Node app that scrapes information from different websites and displays them in panels.
MIT License
3 stars 0 forks source link

Suggestion: use the RSS feeds rather than HTML scraping #13

Closed omgmog closed 7 years ago

omgmog commented 8 years ago

Hello,

Just stumbled across your project while looking for repos to help with.

I've noticed that you're scraping the HTML content of sites, and so when they change things you're encountering issues.

Why don't you just use the RSS feed for each site? It will give you a predictable data structure for posts, and it's unlikely to change when they make changes to the front-end of their sites.

Engadget RSS: http://www.engadget.com/rss.xml Gizmodo (US) RSS: http://feeds.gawker.com/gizmodo/full

This package will allow you to work with the resulting xml easily: https://www.npmjs.com/package/xml2js

akhilpandey95 commented 8 years ago

Hey @omgmog , Thanks for the help man At the inception of the project i was fully confident of using the RSS Feeds for scraping , although below are some crucial points that made me alter my choice :

NOTE :

Please understand the broader aspect of the application and let me elaborate it. If you had observed the application, as soon as the app starts both scraping as well as running an instance of a web application would take place, So there would be small frames displaying information on the front-end. For instance if i have an author publishing content on engadget, r/technology and other listed sites concurrently then on the basis of time stamp i would only display the data once instead of including it twice. So in order to achieve such tasks i guess RSS might give me a tough time.

Finally there is one more thing, to be noted I choose the websites not just on a random basis, each of them is good in a specific sector for a reason, so i wanted tribble to be not a simple scraper but a collective representation of information.