-
# EASM
Create a comment with any of the following templates for the tools and Github Actions would take it and trigger the corresponding application and return the results from the tool in a new comm…
-
For sunset website
![image](https://cloud.githubusercontent.com/assets/4623063/12933984/2d129e94-cf40-11e5-8695-56283a1a8c91.png)
-
Thank you for the great work!
The repo is great for reproducing the entire data processing pipeline, but a lot of people (including me) seem particularly interested in studying the final quality f…
-
I using default script to download c4 with config en.realnewslike, and getting an error
RuntimeError: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='commoncrawl.s3.amazonaws.com', po…
-
### Version
1
### DataCap Applicant
FileTech
### Project ID
FileTech-02
### Data Owner Name
CommonCrawl
### Data Owner Country/Region
United States
### Data Owner Industry
Life Science / He…
-
I had some problems running on AWS EMR with the default mrjob.conf. In case anyone else is running into similar issues, I found that I needed to make two minor changes to mrjob.conf: change python2.7 …
-
Few Feature requests and/or requests for help using cdxj-indexer!
--> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects…
-
What if I want to use own pretrained fasttext model (or even commoncrawl model instead of standard wiki one)? E.g. look what they publish now: https://fasttext.cc/docs/en/crawl-vectors.html.
Current …
-
* Readme: "These 83 languages are detected"
* python (import cld2; help(cld2)): 161 languages (175 language-script combinations), 240 total language-script combinations
* python (import cld2; print(…
-
There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.
The paper notes:
> We preprocess five CommonCrawl dumps, ran…