datatogether / sentry

Parallelized web crawler written in Golang
GNU Affero General Public License v3.0
14 stars 6 forks source link

Automatic Collection Generation #4

Open b5 opened 7 years ago

b5 commented 7 years ago

A common characteristic that's emerging on government sites is an HTML page with numerous direct links to content-urls for example: http://www.nrel.gov/gis/data_solar.html

In an ideal world, these pages should automatically generate collections & attribute metadata to that collection based on HTML content (page title as collection title, meta tags scrutinized & added, etc).

I'm not totally sure how to pull this off, it may be as simple as looking for more than 10 direct links to content urls. Some of this thinking should be driven by analyzing already-crawled content.