-
```
[MYUSER@MYHOST ~]$ stat .s3cfg
File: `.s3cfg'
Size: 1889 Blocks: 8 IO Block: 4096 regular file
Device: fd02h/64770d Inode: 524485 Links: 1
Access: (0644/-rw…
-
Hi, is it possible to have access to original training code of MarkupLM (CommonCrawl preprocess, tags masking, etc.) ?
-
I was hoping to use this project to look at some newer data. I assume I should just add the name of the indexes in the file 'index.txt'?
-
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Healthcare
### Website
https://commoncrawl.org/
### Social Media Handle
http…
-
Hello and thank you for the great work.
I am trying to understand the quality filter you had, described [here](https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc)
I took you…
-
This error prevents me from executing GAU correctly:
WARN[0000] error reading config: open /home/kali/.gau.toml: no such file or directory
-
# EASM
Create a comment with any of the following templates for the tools and Github Actions would take it and trigger the corresponding application and return the results from the tool in a new comm…
-
Hey there,
Does IndicCorpus and OSCAR corpus come from the same source. ie: CommonCrawl ? i have been thinking to combining OSCAR + IndicCorpus to get a better and bigger corpus(with deduplication).…
-
EDIT: this helped with `Wrong FS`, more tickets incoming ;)
```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")```
Hey folks, I'm trying to read some common crawl data from S3...…
-
hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline.
Wondering if this is bcs of the network settings from my side or is there anything wrong?
Thanks in ad…