enjalot / blockbuilder-search-index

download and process d3.js blocks for further indexing and visualization
BSD 3-Clause "New" or "Revised" License
24 stars 5 forks source link

investigate finding d3 blocks in commoncrawl dataset #33

Open micahstubbs opened 7 years ago

micahstubbs commented 7 years ago

http://commoncrawl.org/ h/t @redblobgames for this idea

could also possibly use links in this data as an search ranking score component

redblobgames commented 7 years ago

This page-level graph is based on common crawl, a wee bit out of date, but the index/arc data format looks easy to parse, and the total data is only ~20GB http://webdatacommons.org/hyperlinkgraph/

redblobgames commented 7 years ago

Common Crawl has an index now. For example, here's json listing urls they have for bl.ocks.org:

http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=bl.ocks.org/*&output=json&page=0

has some urls listed. However it doesn't have many urls :( so I went through a whole bunch of past crawls and … still only got 537 usernames for bl.ocks.org and 5 for blockbuilder.org. So Common Crawl is probably not that useful for discovery. (It may still be useful for ranking but that's data I haven't looked into, as it's much more work.)

blocks-usernames.txt

micahstubbs commented 7 years ago

@redblobgames nice! thanks for this research. I'll diff those usernames found via common crawl with the current list of d3 block-making github users we know about.

said current list https://github.com/enjalot/blockbuilder-search-index/blob/master/data/usables.csv