-
Hi there 👋
In the tutorial `tutorials/pretrain_redpajama.md` it's said that you can download full-size and sample-size RedPajama dataset with help of `git lfs`.
At least as of right now, it's pos…
-
When I install the `anas-awadalla/mpt-1b-redpajama-200b` language encoder from HuggingFace, I get the following warning message:
```
A new version of the following files was downloaded from https:…
-
When running the download.py in the current 'book' file, an error occurs:
It seems like this is because this dataset is defunct:
-
Hi, thanks for sharing this code base.
After I run the script of `bash scripts/run_pile.sh`, I obtain the following results:
The generated domain reweights have slightly differences from the r…
-
### Describe the bug
When loading some jsonl from redpajama-data-1T github source [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) fails due to…
-
Hello! I would like to know if there are any unseen errors or limitations when prompting a model on mobile compared to a PC/laptop.
Specifically, we are testing a RAG system where we provide the mo…
-
Hi, from your blog post it seems that the redpajama-v2 has performed an exact dedup for all dumps.
My question is: did you perform dedup for each dump individually or, is it done across different d…
-
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/discussions/25
-
I have a issue, I used the two data sets you provided: [book,github]
the mds_sample_redpajama is this:
![image](https://github.com/princeton-nlp/LLM-Shearing/assets/152595968/e557fe72-543c-445a-b26e…
-
### Describe the bug
**emotions = load_dataset('emotion')**
_UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte_
### Steps to reproduce the bug
load_datas…