commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

Investigate using CloudFlare's zlib #13

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

Decompressing the WARC files from Common Crawl is a relatively slow step in the indexing process. It would be great to see how much improvement CloudFlare's version of zlib can bring.

Some benchmarks here: http://www.snellman.net/blog/archive/2015-06-05-updated-zlib-benchmarks/

We should keep the ability to fallback on the regular implementation.

To do this we may have to fork commoncrawl/gzipstream, which is the place where zlib is imported: https://github.com/commoncrawl/gzipstream/blob/master/gzipstream/gzipstreamfile.py#L2

If the improvement is significant we should use it and add the build commands to our Dockerfile.