-
environment:
CUDA 1.17
tensorflow2.14
code:
https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_keras_main.py
command:
python3 /LLM/models/official/recommendation/…
-
### Your current environment
```sh
python env.py
Collecting environment information...
PyTorch version: 2.6.0.dev20241011+rocm6.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM us…
-
Is there any way to save intermediate checkpoints during training?
Sometimes my training may fail during the middle due to external reasons, it will be helpful to save every N steps so I can contin…
-
### Description
### **Concept introduction**
The fact that SPMD has no scheduling overhead gives it the best performance, but it is often not easy enough to develop complex training tasks. For exa…
-
### Your current environment
Hey Team,
I was experimenting with class **LLM** using gptq_marlin on the GPU, and it is incredibly fast. However, when I tried running it on the CPU, it seems that …
-
### 请提出你的问题
- 前提条件:单机单卡已经跑通chatglm2的lora微调训练代码;llama的多卡pp并行训练已跑通。
- 问题场景:想进一步尝试单机多卡,设置/chatglm2/lora_argument.json配置文件中的 "pipeline_parallel_degree": 4,然后参照官网样例,启动命令行:`srun --gres=gpu:4 python3 -u …
-
### Your current environment
docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2
### 🐛 Describe the bug
The model works f…
-
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, plea…
-
Hi,
Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel
`python -m lamorel_launcher.launch --config-path /home/yanxue/Groun…
-
### Your current environment
The output of `python collect_env.py`
```text
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_e…