-
Is this project deprecated? I see there are no commits since 2013, and there appears to be a new index scheme available since 2015: http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
…
-
when trying to build the project with cargo build --release i'm getting this error.
![Screenshot 2024-08-04 122401](https://github.com/user-attachments/assets/234d532b-17b9-4c9b-9292-a19fd5975de4)
…
-
What if I want to use own pretrained fasttext model (or even commoncrawl model instead of standard wiki one)? E.g. look what they publish now: https://fasttext.cc/docs/en/crawl-vectors.html.
Current …
-
Hi,
do you sampled each dataset (Wikipedia, Common Crawl, Subtitles etc.) equally during German-BERT Training?
OpenAI uses a unequal sampling, which may lead to a better result, as stated in the G…
-
**Describe the bug**
* The GitHub API is very slow (see https://github.com/projectdiscovery/subfinder/discussions/1393)
* Hunter's API gives me an error with the -v option:
```
[WRN] Could not run sou…
-
hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline.
Wondering if this is bcs of the network settings from my side or is there anything wrong?
Thanks in ad…
-
There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.
The paper notes:
> We preprocess five CommonCrawl dumps, ran…
-
When processing CommonCrawl, I frequently get SlowDown Errors: `{'Error': {'Code': 'SlowDown', 'Message': 'Please reduce your request rate.'}`. Is this common? Are there any recommended strategies for…
-
When I run the "python train.py --saveto commoncraw_pretrained --dataset commoncrawl --cutoff 15", the got the following error:
Traceback (most recent call last):
File "train.py", line 341, in
…
-
- uid: unsupervised_cross_lingual_representation_learning_at_scale
- type: processed
- description:
- name: Unsupervised Cross-lingual Representation Learning at Scale
- description: This pap…