-
https://commoncrawl.org/
> We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
I'm not sure how much data it is, but certainly a few TB.
ghost updated
7 years ago
-
### Data Owner Name
Mongo2Stor
### What is your role related to the dataset
Data Preparer
### Data Owner Country/Region
United States
### Data Owner Industry
Not-for-Profit
### Website
[https://da…
-
Hi lena-voita and RachitBansal,
I am trying to reproduce the experiment using the *WMT2018 (the Yandex corpus, EN-RU)*. However, the result I got wasn't very satisfying.
I guess I might have …
-
http://commoncrawl.org/2016/10/news-dataset-available/
We should make sure it works with the current [common crawl source](https://github.com/commonsearch/cosr-back/blob/master/cosrlib/sources/common…
-
I'm not sure whether Python does a DNS resolve for every `urlopen` call or not. I noticed that `data.commoncrawl.org` returns multiple IPs, so we could spread the load over multiple cloudfront servers…
-
Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the sessi…
-
### Description
When trying to execute the CommonCrawl generator script (`tensor2tensor/data_generators/wikisum/get_references_commoncrawl.py`), I run into several issues that point towards compatibi…
-
Love that you pull from wayback. There are more datasets that you can use too. Complimenting wayback with resources such as http://commoncrawl.org may result in more results.
-
EDIT: this helped, the doc may need to be updated:
```
sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")
```
**Describe the bug**
According to the docs, `aut` should be able to r…
-
Great work for fixing this mate in 5.3.0
Importing EN now, do you know of other feeds people use with it?
Have you ever thought about doing something like this with the CommonCrawl?