Open micahstubbs opened 7 years ago
This page-level graph is based on common crawl, a wee bit out of date, but the index/arc data format looks easy to parse, and the total data is only ~20GB http://webdatacommons.org/hyperlinkgraph/
Common Crawl has an index now. For example, here's json listing urls they have for bl.ocks.org:
http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=bl.ocks.org/*&output=json&page=0
has some urls listed. However it doesn't have many urls :( so I went through a whole bunch of past crawls and … still only got 537 usernames for bl.ocks.org and 5 for blockbuilder.org. So Common Crawl is probably not that useful for discovery. (It may still be useful for ranking but that's data I haven't looked into, as it's much more work.)
@redblobgames nice! thanks for this research. I'll diff those usernames found via common crawl with the current list of d3 block-making github users we know about.
said current list https://github.com/enjalot/blockbuilder-search-index/blob/master/data/usables.csv
http://commoncrawl.org/ h/t @redblobgames for this idea
could also possibly use links in this data as an search ranking score component