Automatic Collection Generation

A common characteristic that's emerging on government sites is an HTML page with numerous direct links to content-urls for example: http://www.nrel.gov/gis/data_solar.html

In an ideal world, these pages should automatically generate collections & attribute metadata to that collection based on HTML content (page title as collection title, meta tags scrutinized & added, etc).

I'm not totally sure how to pull this off, it may be as simple as looking for more than 10 direct links to content urls. Some of this thinking should be driven by analyzing already-crawled content.

datatogether / sentry

Automatic Collection Generation #4