-
The dummy gateway currently lists a number of subscription features as supported but does not indicate support for `tokenization`.
As tokenization is more flexible than the subscriptions features, it…
-
scEmbed is an excellent job that provides an dimensionality reduction encoding for scATAC-seq data. When I tried to use it to map my data, I found that it took an extremely long time to run model.enco…
-
Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire pro…
-
I need to index documents in multiple languages, such as English, German, Russian, Japanese, Korean, and Chinese. May I ask if these languages are currently supported? Does the system support n-gram t…
-
I tried finetuning my model after stage 1. Apparently, there are tokenization mismatches and the loss is 0.
Do you have any ideas what might be the problem.
Thanks!
sh finetune_full.sh
```WARNIN…
-
- [x] Push Branch with starter code
- [x] #54
- [ ] Add dataloading support in the pytorch_dataset class
- [ ] Add modeling support
-
Thank you for sharing the great source code. I have been trying to pretrain and fine-tune with LLaMA 3.1. While the pretraining works fine, I noticed that the following warnings occur during the fine-…
-
I made some changes to the model (3D convs) and trained the small one with 128 tokens on 128p 16-frame videos pre-compressed with CogvideoX's VAE and MSE loss.
Turned out better than I expected consi…
-
### Describe the issue
Issue: after pretraining Phi-3 with/without VIP, I am getting tokenization mismatch.
Command:
```
bash scripts/finetune_vip_llava_phi3_stage2.sh
```
Log:
```
WARN…
-
The code from https://huggingface.co/Salesforce/codegen25-7b-multi_P#causal-sampling-code-autocompletion and https://github.com/salesforce/CodeGen/tree/main/codegen25#sampling does not work currently.…