-
Compare with what we get from [https://webrecorder.io/]
Read with [https://github.com/ikreymer/webarchiveplayer]
Try with warcdump command
-
As mentioned in commonsearch/cosr-results#2, some big domains are missing from Common Crawl for various reasons that we will try to fix, but we should have a fallback with "fake" documents created fro…
-
look for app metadata in common crawl, like w3c app manifest, open graph tags
https://commoncrawl.org/the-data/get-started/
-
If a gzipped WARC file doesn't contain the extra field, line 84 in `GzipHeader.cs` will fail as `CompressedSize` will be 0 and `br.BaseStream.Position` can't be negative.
-
The WARC file rotation may hapen unnecessarily often:
```
% ls -lh /data/warc/
-rw-r--r-- 1 storm storm 983M Sep 28 07:43 CC-NEWS-20160927074341-00000.warc.gz
-rw-r--r-- 1 storm storm 42M Se…
-
## General motivation
Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expres…
-
The [WEATGenerator](https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java) chokes on some WARC fails and fails with a StringIndexOutOfBoun…
-
For a feasible list of URLs (max hunderds of thousands) given as a parameter.
-
The 2015-11 output was actually performed on CC-MAIN-2015-27, so we need to
- [x] rename output folders in the S3 bucked
- [x] update the readme appropriately
-
When I run
`spark-submit jobs/spark/index.py --warc_limit 1 --only_homepages --profile`
as described in README.md, the follow error will appear:
16/03/15 07:15:18 INFO BlockManagerMaster: Registere…