-
[DMOZ](http://www.dmoz.org/) is a good source to get a large list of news sites for various languages and countries:
1. define a set of English categories matching the topic
2. extract the translation…
-
Currently when the flink-crawler job is cancelled, we get a bunch of errors (and other interesting output) in the log (see below). Some comments about this:
1. I think we need to make CommonCrawlFe…
-
Hi, I have some problems letting the MR jobs run on EMR. The scripts work locally and they work with 1 or 10 warc files on EMR but with 100 I always get some failures of the type "PipeMapRed.waitOutpu…
-
This code produces a .cdx and .warc file.
```
ArchiveSpark.load(sc, WarcCdxHdfsSpec(cdxPath = "/data/example.cdx.gz", warcPath = "/data"))
.filter(r => r.surtUrl.startsWith("com,example"))
.saveAs…
-
this code effectively loads the CDX index, and gets the warc pages if needed.
`val rdd: RDD[WarcRecord] = ArchiveSpark.load(sc, WarcCdxHdfsSpec("/data/example.cdx.gz", "/data"))`
I see that defini…
-
In your paper Learned in Translation: Contextualized Word Vectors (McCann et. al. 2017), it says
> We used the CommonCrawl-840B GloVe model for English word vectors, which were completely fixed d…
-
My apologies if this is obvious, but I don't know that much about website logs.
I'm using [FastMail](https://www.fastmail.com/) to [serve files](https://www.fastmail.com/help/files/website.html). …
-
Dear helgeho,
The common crawl its cdx files have a different structure then you library expects.
`au,com,canberratimes)/business/mining-and-resources/bhp-on-the-hunt-for-new-chief-ft-20121106-28x…
-
According to https://github.com/iipc/warc-specifications/issues/23, the standard says that WARC-Target-URI should be surrounded by , such as in:
`WARC-Target-URI: http://www.archive.org/images/logoc.…
-
[cc-pyspark](/commoncrawl/cc-pyspark) already uses boto3 to download data from s3://commoncrawl/: faster multi-part downloads and less errors (timeouts, "503 slow down"). The upgrade should improve th…