-
### Your current environment
```text
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC …
-
Is there any way to save intermediate checkpoints during training?
Sometimes my training may fail during the middle due to external reasons, it will be helpful to save every N steps so I can contin…
-
### Your current environment
环境:
torch 2.3.0
vllm 0.5.0.post1
transformers 4.41.2
主要报错情况:
moe小一点的模型 '/data/models/qwen/qwen1.5-2.7Bmoe' 不会出问题
对于大一点的就报错如最下面。
代码:
from vllm.engine.arg_ut…
-
I am trying to run a training script using deepspeed on 8 32GB V100 GPUs.
For debugging, I enabled the following flags:
```
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,GRAPH
NCCL_TOPO_DUMP_FILE=top…
-
### System Info
- `transformers` version: 4.43.0.dev0
- Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.4
- Safetensors versio…
Skit5 updated
2 weeks ago
-
I am trying to run the finetuning script using 8 32GB V100 GPUs. I am using the torchrun command for using deepspeed with both parameter and optimizer offload, with a few minor modifications:
```
to…
-
nproc_per_node=4
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
--model_id_or_path "AI-ModelScope/llava-v1.6-mistral-7b" \
--template_type "llava-mistral-inst…
-
### 🚀 The feature, motivation and pitch
Hi Pytorch maintainers,
I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…
-
Here's the call I'm using to run the script:
```
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file examples/hf-alignment-handbook/configs/accelerate_configs/deepspeed_zero3.yaml --num_proces…
-
1*8H100 DGX BOX
Torch version: 2.1.1
CUDA version: 12.1
VLLM: 0.2.3
The inference works just fine in tensor parallel 1 but when using **tp > 1** I am getting this error below:
WARNING 12-0…