BIDS-projects / scraper

Collects data from websites of data science institutions
2 stars 0 forks source link

Get tiers of websites #19

Closed don-han closed 8 years ago

don-han commented 8 years ago

This might be an expansion or a replacement for #17

Assumption: Links that are closest to the base_url contains the most interesting information.

Reasoning: When a web developer plans out a website, s/he usually places what they want to show off at the front-page since it will be where most of the traffic will first land upon.Therefore, by the choice of design, any links on the front page are deemed the most important by the staff of that organization. Likewise, the links of the linked webpages are as important albeit less so than the landing page.

Perhaps we can build a weighting algorithm

don-han commented 8 years ago

@chewisinho suggested that we use Zipf's law or the more genearl power law as a weighting algorithm. I added the weighting issue to the LDA repo (https://github.com/BIDS-projects/lda/issues/10)