-
### Version
1
### DataCap Applicant
Mongo2Stor
### Project ID
CommonCrawl
### Data Owner Name
Common Crawl
### Data Owner Country/Region
United States
### Data Owner Industry
Not-for-Profit…
-
### Version
1
### DataCap Applicant
Quasar
### Project ID
Quasar-1
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
…
-
### Version
1
### DataCap Applicant
DATADAO
### Project ID
DATADAO-01
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Hea…
-
### Version
1
### DataCap Applicant
DataVault Solutions
### Project ID
DataVault Solutions-03
### Data Owner Name
Commoncrawl
### Data Owner Country/Region
United States
…
-
## タイトル: GlotCC:少数言語向け大規模CommonCrawlコーパスとパイプライン
## リンク: https://arxiv.org/abs/2410.23825
## 概要:
大規模テキストコーパスの必要性は、事前学習済み言語モデルの登場、特にこれらのモデルにおけるスケーリング則の発見に伴い増加しています。既存のコーパスのほとんどは、支配的な大規模コミュニティを持つ言語に対…
-
Running the command for a few times produces no output after the 3rd or 4th command
root@kali:~/Desktop# echo hkt.com | gau --subs --threads 50 --verbose
INFO[0000] fetching hkt.com …
-
### Data Owner Name
Commoncrawl
### What is your role related to the dataset
Data Preparer
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Healthcare
### Web…
-
For my own purposes (see https://github.com/commoncrawl/web-languages) one of my volunteers made a mapping from 333 Wikipedia names to the appropriate 3-letter ISO 639-3 language code. Are you interes…
-
### Version
2024-06-26T12:36:12.600Z
### DataCap Applicant
@lyjmry
### Data Owner Name
Common Crawl
### Data Owner Country/Region
Not-for-Profit
### Website
https://commoncrawl.org
…
-
### Data Owner Name
Common Crawl
### What is your role related to the dataset
Data Preparer
### Data Owner Country/Region
United States
### Data Owner Industry
Not-for-Profit
### Website
http…