commoncrawl / cc-index-server

Common Crawl Index Server
http://index.commoncrawl.org/
65 stars 18 forks source link

[PyWB2] Remove "source" and "source-coll" fields from results #7

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

With PyWB 2.x every result record contains two extra fields "source" and "source-coll" absent in the original index, e.g.

{
  "url": "http://commoncrawl.org/",
  "mime": "text/html",
  "mime-detected": "text/html",
  "status": "200",
  "digest": "FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT",
  "length": "5413",
  "offset": "42695747",
  "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313617.6/warc/CC-MAIN-20190818042813-20190818064813-00014.warc.gz",
  "charset": "UTF-8",
  "languages": "eng",
  "source": "CC-MAIN-2019-35/indexes/cluster.idx",
  "source-coll": "CC-MAIN-2019-35"
}

This is redundant as the collection (aka. "source") is explicitly queried and means 20% more content with Content-Encoding "identity" (which is mostly used in requests). The 20% matter, given that the index server answers 10 millions of requests per month sending multiple TiB results.

Note: there is a nosource param in BaseAggregator,, must be passed permanently resp. made configurable in config.yaml.

sebastian-nagel commented 3 years ago

Addressed in commoncrawl/pywb@00a84c9