-
(reported by Christian Lund on [Common Crawl Google group](https://groups.google.com/d/msg/common-crawl/WML9YcG-Ty8/asA1G0NXBgAJ))
For HTML elements only the attributes `name`, `rel`, `content` and …
-
Currently we don't know how our implementation performs in a larger environment.
Therefore, we want to test it on Microsoft Azure / Docker.
This involves the following steps:
- [x] @MusicConnecti…
-
Is either the Common Crawl data or the script to get the data available anywhere?
Thanks.
-
Hi,
I am trying to execute the below program on windows 7 using **-r local** as the input param and getting the below error. However if i dont pass -r local, it works fine
My program
**mrcc.py…
-
Looks like elasticrawl thinks it is a directory above where it actually is since the s3 path change. This used to show segments like `1448398462665.97` and `1448398462686.42` now has a single segment,…
-
More details from this CommonCrawl [user group post](https://groups.google.com/forum/#!msg/common-crawl/L4-Sxz_wkTg/y5b_siYlEwAJ)
-
As a cronjob; using s3cmd?
-
The WARC standard recommends to compress every record independently "[record-at-time](https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#record-at-t…
-
The WARC standard recommends to compress every record independently ["record-at-time"](https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#record-at-…
-
So i recently deployed the project on aws and i was supriced by the low performance of indexer. I investigated and found out that spark indexer only uses 1 core of all available( on 100% ), why is tha…