-
考虑提供和commoncrawl一样的下载方式吗
-
Re a comment by @davclark on email ("If folks have already-developed datasets that are amenable to a range of text processing, please let me know!"):
- See https://www.courtlistener.com/, especially h…
-
2024-02-14 21:01 INFO 2048692:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2020-10_0278.tsv.gz [200] took 8s (5766.4kB/s)
2024-02-14 21:01 INFO 2048692:root - Starting downl…
-
When I use `python -m cc_net ` to download and extract work, I am told that the connection cannot open
`requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.comm…
-
Traceback (most recent call last):
File "download_bios.py", line 255, in
assert r.status_code == 200
AssertionError
The error code seems to be 403
-
If an error occurs the index server responds with HTTP status code 200 OK, it should return a 503 or 5xx error. Seen with:
- call of non-existing API endpoint (collection) return 200 + empty result
…
-
At the moment in the link to [contributing](https://voice-sprint.mozilla.community/contributing/) it suggests CommonCrawl and OpenSubtitles as good places to find text, while saying that Wikimedia sit…
-
One more question, please.
using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of common…
-
### Version
1
### DataCap Applicant
FileTech
### Project ID
FileTech-02
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / He…
-
I'm trying Common Crawl w/ Hadoop 0.20.205 and I'm getting the following:
Exception in thread "main" java.lang.VerifyError: (class: org/commoncrawl/hadoop/io/JetS3tARCSource, method: configureImpl si…