-
Great work for fixing this mate in 5.3.0
Importing EN now, do you know of other feeds people use with it?
Have you ever thought about doing something like this with the CommonCrawl?
-
### Version
1
### DataCap Applicant
FileTech
### Project ID
FileTech-02
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / He…
-
Is there a method that supports directly open a file URL like `smart-open`?
https://pypi.org/project/smart-open/
```
open('s3://commoncrawl/robots.txt')
```
-
Hi, is it possible to have access to original training code of MarkupLM (CommonCrawl preprocess, tags masking, etc.) ?
-
Dear authors,
First of all, thanks for this very interesting paper and code release.
I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across…
-
Hey there,
Does IndicCorpus and OSCAR corpus come from the same source. ie: CommonCrawl ? i have been thinking to combining OSCAR + IndicCorpus to get a better and bigger corpus(with deduplication).…
-
This loads a WARC file from local file system:
`val r: RDD[ArchiveRecord] = RecordLoader.loadArchives(path, sparkConf)`
How to load a WARC file from amazon S3?
I found this gist. Is this the correct…
-
When I execute:
`python -m cc_net -l fa`
It throws the following exception:
```
File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.…
-
```
[MYUSER@MYHOST ~]$ stat .s3cfg
File: `.s3cfg'
Size: 1889 Blocks: 8 IO Block: 4096 regular file
Device: fd02h/64770d Inode: 524485 Links: 1
Access: (0644/-rw…
-
I am referring to process-commoncrawl-with-emr, writeup
I was able to successfully create the clusters. But when I tried to ssh on to master node to execute hdfs and hadoop commands, it just did not …