-
Hi,
I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when reading files from CC, t…
-
### Data Owner Name
Commoncrawl
### What is your role related to the dataset
Data Preparer
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / Healt…
-
From my amass.log file:
```
"20:55:34.041797 Sublist3rAPI: https://api.sublist3r.com/search.php?domain=somedomain.com: Get "https://api.sublist3r.com/search.php?domain=somedomain.com": dial tcp: l…
-
### Data Owner Name
Common Crawl
### Data Owner Country/Region
United States
### Data Owner Industry
IT & Technology Services
### Website
https://commoncrawl.org/
### Social Media Handle
http…
-
CommonCrawl has the [WET files](http://commoncrawl.org/the-data/get-started/), which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it p…
-
@johnmyleswhite, @tanmaykm and I have been discussing doing a blog post on indexing, as a way to show Julia's capabilities for working with large datasets in parallel. This started with [HW2 in our MI…
-
340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mai…
-
http://commoncrawl.org/ h/t @redblobgames for this idea
could also possibly use links in this data as an search ranking score component
-
## Describe the bug
Edit: three week review and cleanup.
At the bottom of this report is my config.yaml for reference.
Following along the documentation with regard to fallbacks via a "Sequen…
-
Download front pages of several million websites with curl.
Record all metadata such as: headers, redirects, TLS version, cipher... as well as data (HTTP body).
Create a dataset from it. The dataset…