Common Crawler

🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.

Notice in regards to development

Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.

As a GUI

An electron based interface that works with a Go server will be available.

As a library

Install as a dependency:

go get https://github.com/ChrisCates/CommonCrawler

Access the library functions by importing it:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

As a command line tool

Install from source:

go install  https://github.com/ChrisCates/CommonCrawler

Or you can curl from Github:

curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler

Then run as a binary:

# Output help
commoncrawler --help

# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths

# Start crawling the web
commoncrawler start --stop -1

Compilation and Configuration

Installing dependencies

go get github.com/logrusorgru/aurora

Downloading data with the application

First configure the type of data you want to extract.

// Config is the preset variables for your extractor
type Config struct {
    baseURI     string
    wetPaths    string
    dataFolder  string
    matchFolder string
    start       int
    stop        int
}

//Defaults
Config{
    start:       0,
    stop:        5,
    baseURI:     "https://commoncrawl.s3.amazonaws.com/",
    wetPaths:    path.Join(cwd, "wet.paths"),
    dataFolder:  path.Join(cwd, "/output/crawl-data"),
    matchFolder: path.Join(cwd, "/output/match-data"),
}

With Docker

docker build -t commoncrawler .
docker run commoncrawler

Without Docker

go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler

Or you can run simply just run it.

go run src/*.go

Resources

MIT Licensed
If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca
You can post issues if they are valid, and, I could potentially fund them based on priority.

ChrisCates / CommonCrawler

readme