-
On line 102 of commoncrawl/src/main/java/org/commoncrawl/util/shared/ARCFileReader.java, there's a comment that says the constructor is private (it's actually public), and refers to the "factory metho…
-
If an error occurs the index server responds with HTTP status code 200 OK, it should return a 503 or 5xx error. Seen with:
- call of non-existing API endpoint (collection) return 200 + empty result
…
-
Re a comment by @davclark on email ("If folks have already-developed datasets that are amenable to a range of text processing, please let me know!"):
- See https://www.courtlistener.com/, especially h…
-
I am trying to copy a file in text mode, but it is not working. The URL is com.wordpress.alinebessa/2011/06/11/documenting-accerciser-first-impressions/:http
which exists in CommonCrawl. When I check…
-
EDIT: this helped with `Wrong FS`, more tickets incoming ;)
```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")```
Hey folks, I'm trying to read some common crawl data from S3...…
-
2024-02-14 21:01 INFO 2048692:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2020-10_0278.tsv.gz [200] took 8s (5766.4kB/s)
2024-02-14 21:01 INFO 2048692:root - Starting downl…
-
Here we combine all the datasets we can collect
- [OSCAR's CommonCrawl Dataset](https://traces1.inria.fr/oscar/)
- [Arabic BERT Corpus](https://www.kaggle.com/abedkhooli/arabic-bert-corpus)
- [Hi…
-
Hi,
I try to implement this:
```bash
aws --no-sign-request s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2021
```
It is a public AWS data-set, so no authentication. I try to work around that issue…
-
If a field requested by the `fl` parameter is missing in one of the records, the query processing exits with an exception and the result list is truncated:
```
Traceback (most recent call last):
…
-
- [ ] Check if get_text(link) == None and link in seed_urls:
- [ ] add flag inactive.
- [ ] otherwise add active.