DocNow / docnow

A Twitter data collection and appraisal application.
MIT License
50 stars 8 forks source link

Alexa Top 500 filter #66

Open edsu opened 6 years ago

edsu commented 6 years ago

Via @tjowens:

It could be useful to look at the URLs through a filter of the Alexa Top 500. This would allow a curator to look for content that might not otherwise be picked up by a web archiving service like Internet Archive.

The only caveat here is that if curators don't look at the other content they could be missing out. For example there are lots of Twitter, YouTube and Instagram URLs that will not be archived, even though those websites are in the Alexa Top 500. Also, lots of copies keeps stuff safe: maybe it's OK for more web archives to have a copy of something. It's worked pretty well for libraries...

edsu commented 6 years ago

Another (complementary?) angle on this could be the GDELT project's list of frontpage stories for 50,000 online news sites:

https://blog.gdeltproject.org/announcing-gdelt-global-frontpage-graph-gfg/

I'm assuming that Internet Archive's coverage of these 50k is probably pretty good. But it would be interesting to see. Even if the page is in Internet Archive I think there's value in building scoped/themeatic collections. And like above, being able to filter out these sites could be a ways of discovering links to things that are unlikely to be archived.

edsu commented 6 years ago

@atomotic suggested that CISCO's top DNS list could be useful too.