-
**Describe the bug**
Hello, I met a bug about cacheDataset when I follow the training way provided by the research-contributions/DiNTS/train_multi-gpu.py. I used the MSD 03 liver dataset, when using …
-
**Describe the bug**
When training model using deepspeed 0.14.2. I got this error:
```python
Traceback (most recent call last): …
-
### Before Asking
- [X] I have read the [README](https://github.com/meituan/YOLOv6/blob/main/README.md) carefully. 我已经仔细阅读了README上的操作指引。
- [X] I want to train my custom dataset, and I have read …
-
The function to get the device: init_distributed_device(args), in training.main.py line 129, it seems that only a single-GPU device can be obtained. The key part of the function is defined as follow…
-
nproc_per_node=4
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
--model_id_or_path "AI-ModelScope/llava-v1.6-mistral-7b" \
--template_type "llava-mistral-inst…
-
### Bug description
I followed [this](https://curiousily.com/posts/multi-label-text-classification-with-bert-and-pytorch-lightning/) tutorial to build a lightning model for multi-label text classific…
-
How can we synchronize files that are written during multi-node training?
* At the end of training, each node reads the file in question, turns in to byte tensor
* Synchronize the tensor length, com…
-
### System Info
- `transformers` version: 4.41.2
- Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
- Python version: 3.10.0
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.2…
-
TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make `accelerate` and `deepspeed` work together.
So, I have this training script: https:/…
-
sh examples/pretrain_t5.sh
setting number of micro-batches to constant 1
> building BertWordPieceLowerCase tokenizer ...
> padded vocab (size: 21230) with 18 dummy tokens (new size: 21248)
> in…