Open edsu opened 6 years ago
Another (complementary?) angle on this could be the GDELT project's list of frontpage stories for 50,000 online news sites:
https://blog.gdeltproject.org/announcing-gdelt-global-frontpage-graph-gfg/
I'm assuming that Internet Archive's coverage of these 50k is probably pretty good. But it would be interesting to see. Even if the page is in Internet Archive I think there's value in building scoped/themeatic collections. And like above, being able to filter out these sites could be a ways of discovering links to things that are unlikely to be archived.
@atomotic suggested that CISCO's top DNS list could be useful too.
Via @tjowens:
It could be useful to look at the URLs through a filter of the Alexa Top 500. This would allow a curator to look for content that might not otherwise be picked up by a web archiving service like Internet Archive.
The only caveat here is that if curators don't look at the other content they could be missing out. For example there are lots of Twitter, YouTube and Instagram URLs that will not be archived, even though those websites are in the Alexa Top 500. Also, lots of copies keeps stuff safe: maybe it's OK for more web archives to have a copy of something. It's worked pretty well for libraries...