-
-
https://groups.google.com/forum/?hl=en#!msg/common-crawl/0fYTJtFD6Fs/qblUucr9BgAJ;context-place=forum/common-crawl
"You should be using ranged requests to avoid downloading the whole archive. Try this…
-
I get an error when trying to install the requirements with pip install -r requirements.txt. I'm highlighting interesting bits in bold. I'm using Ubuntu 14.04. Thanks in advance for your help.
R…
-
see [https://github.com/commoncrawl/nutch/blob/3551eb6dbb7f7152a13d2e4eb0f8eb6014dc8252/src/java/org/commoncrawl/tools/WarcExport.java#L140]
-
It reads them, but it the data remains compressed, thus defeating line iteration.
-
I am trying to work with Common Crawl file and came across Crate.io.
I am following the instructions as specified in the instruction part.
I am having trouble with this command:
`COPY commoncrawl F…
-
When writing the sequence file (at the very end of the job) dosamply.py sometimes fails with an encoding error:
```
+ python dosample.py --verbose --shards=300 --splitfile=s3a://cc-cdx-index/2014-49…
-
this [spec](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#output-format-json) describe the json output of a cdx server as an array of arrays.
pywb cdx-server ret…
-
The warcinfo record returned by [WARCRecordFormat](../blob/master/src/main/java/com/digitalpebble/stormcrawler/warc/WARCRecordFormat.java) lacks a trailing CRLF which causes some WARC libraries fail t…
-
CommonCrawl has the [WET files](http://commoncrawl.org/the-data/get-started/), which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it p…