-
On line 102 of commoncrawl/src/main/java/org/commoncrawl/util/shared/ARCFileReader.java, there's a comment that says the constructor is private (it's actually public), and refers to the "factory metho…
-
http://commoncrawl.org/the-data/
http://commoncrawl.org/the-data/examples/
https://groups.google.com/forum/#!forum/common-crawl
https://nlp.stanford.edu/pubs/cluster-wsdm09.pdf
http://resources.mp…
-
Whenever I do a search on the local cc-index-server I get errors. When I look at the debug logs, it looks like the final authorization is only using the access key ID and the secret, but not the sessi…
-
Hello,
Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.
It seems as if the …
-
Here we combine all the datasets we can collect
- [OSCAR's CommonCrawl Dataset](https://traces1.inria.fr/oscar/)
- [Arabic BERT Corpus](https://www.kaggle.com/abedkhooli/arabic-bert-corpus)
- [Hi…
-
Love that you pull from wayback. There are more datasets that you can use too. Complimenting wayback with resources such as http://commoncrawl.org may result in more results.
-
http://commoncrawl.org/2016/10/news-dataset-available/
We should make sure it works with the current [common crawl source](https://github.com/commonsearch/cosr-back/blob/master/cosrlib/sources/common…
-
I am trying to copy a file in text mode, but it is not working. The URL is com.wordpress.alinebessa/2011/06/11/documenting-accerciser-first-impressions/:http
which exists in CommonCrawl. When I check…
-
I'm not sure whether Python does a DNS resolve for every `urlopen` call or not. I noticed that `data.commoncrawl.org` returns multiple IPs, so we could spread the load over multiple cloudfront servers…
-
error reading config: open /home/g0xkayala/.gau.toml: no such file or directory
![image](https://github.com/lc/gau/assets/16838353/36c153ed-26f5-49b3-8fc2-089f6d8f7be9)