A common characteristic that's emerging on government sites is an HTML page with numerous direct links to content-urls for example: http://www.nrel.gov/gis/data_solar.html
In an ideal world, these pages should automatically generate collections & attribute metadata to that collection based on HTML content (page title as collection title, meta tags scrutinized & added, etc).
I'm not totally sure how to pull this off, it may be as simple as looking for more than 10 direct links to content urls. Some of this thinking should be driven by analyzing already-crawled content.
A common characteristic that's emerging on government sites is an HTML page with numerous direct links to content-urls for example: http://www.nrel.gov/gis/data_solar.html
In an ideal world, these pages should automatically generate collections & attribute metadata to that collection based on HTML content (page title as collection title, meta tags scrutinized & added, etc).
I'm not totally sure how to pull this off, it may be as simple as looking for more than 10 direct links to content urls. Some of this thinking should be driven by analyzing already-crawled content.