-
### Data Owner Name
Common Crawl
### Data Owner Country/Region
United States
### Data Owner Industry
IT & Technology Services
### Website
https://commoncrawl.org/
### Social Media Handle
http…
-
HF, IBM, ???
Software Heritage
Common Crawl involved?
LIAON? (Ontocord involved...)
-
https://commoncrawl.org/
> We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
I'm not sure how much data it is, but certainly a few TB.
ghost updated
6 years ago
-
### Version
2024-06-26T12:36:12.600Z
### DataCap Applicant
@lyjmry
### Data Owner Name
Common Crawl
### Data Owner Country/Region
Not-for-Profit
### Website
https://commoncrawl.org
…
-
http://data.statmt.org/cc-100/
이 내용은 #187 에 반영하도록 하겠습니다
lovit updated
3 years ago
-
Hi,
Please suggest from where i can get "arpa" file for top 400,000 most frequent words of file en.00 from "common crawl repository", which was used to generate "trie" file for English LM.
-
Hi @Marlin-Na,
while searching for examples how Common Crawl data is used, I stumbled over this nice project and just looked at the following comments:
https://github.com/Marlin-Na/CommonCrawlDL/b…
-
Hello team,
I'm trying to download all the audio and text data associated with the `eng-frA` split of the Seamless data. My issue is with the text data. When I run the `wet_lines` script, after getti…
-
Hi,
Thank you for releasing the codes for data extraction. I am extracting the data based on your scripts and I noted some errors in the log file. Most of them are Common Crawl error code 502/503 …
-
Dear Mr. Sebastian Nagel @sebastian-nagel,
I am the team member of Fordham University S & T team. Would you help me to get plain text content from common crawl. I have collected some useful URLs by …