-
### Version
1
### DataCap Applicant
Quasar
### Project ID
Quasar-1
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
…
-
### Version
1
### DataCap Applicant
Mongo2Stor
### Project ID
CommonCrawl
### Data Owner Name
Common Crawl
### Data Owner Country/Region
United States
### Data Owner Industry
Not-for-Profit…
-
### Version
1
### DataCap Applicant
DATADAO
### Project ID
DATADAO-01
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Hea…
-
### Version
1
### DataCap Applicant
DataVault Solutions
### Project ID
DataVault Solutions-03
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
…
-
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Healthcare
### Website
https://commoncrawl.org/
### Social Media Handle
http…
-
### Version
1
### DataCap Applicant
DATADAO
### Project ID
DATADAO-02
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Hea…
-
As discussed over the last couple weeks and reinforced by the upcoming release, we want to replace the following subsets:
- Dolma CC
- C4
- RefinedWeb
With their equivalent in token count, but s…
-
### GPT-3 data mix
* Datasets are not sampled in proportion to their size
* Datasets we view as higher-quality are sampled more frequently
* WebText2, Book1, Wikipedia datasets are sampl…
-
### Data Owner Name
Commoncrawl
### What is your role related to the dataset
Data Preparer
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Healthcare
### Web…
-
## Version
2024-06-26T12:36:12.600Z
## DataCap Applicant
@lyjmry
## Data Owner Name
Common Crawl
## Data Owner Country/Region
Not-for-Profit
## Website
https://commoncrawl.org
## Social Media Han…