-
### 🐛 Describe the bug
源码安装的colossal-0.3.3,按照https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/llama2/README.md中安装了其他的包。
**启动命令:**
` #!/bin/bash
# NCCL IB environment variabl…
-
We have seen TF2 Albert pretraining crashes intermittently every 1 out of ~3 runs using latest Horovod training on 8 nodes; the crash happens around 3000 steps
Error message:
```
Loss: 6.436, MLM…
-
```
Приветствую.
Есть ли инструкции по обновлению?
Попробовал сам, не получилось.
Имею:
# uname -a
Linux voip.site.ru 2.6.32-042stab094.7 #1 SMP Wed Oct 22 12:43:21 MSK 2014 i686
i686 i386 GNU/Linux
…
-
I succeeded running a deepspeed program and I want to try to debug this program. I am using VSCode but it doesn't support a non-python interpreter. I want to know how to debug a deepspeed program with…
-
### Describe the bug
When I run benchmark osu compiled with hpcx, I got warnings:
`[1635835013.823013] [node181:6471 :async] ib_device.c:475 UCX WARN IB Async event on mlx5_0: GID table cha…
-
run:
sudo sh -c "sh contrib/install.sh node && sh contrib/install.sh web"
error:
build_web_perl FAIL
Appending installation info to /usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/perllocal.pod
-
We are using the A100 ib card for communication. The bandwidth of each ib card is 7GB, but only 1GB/s is got, according to the statistical result from ibdump
-
OOM error when loading flan-tf-xxl model for inference. The model was able to load perfectly without deepspeed, just by using the standard code in huggingface transformers. It used approximately 20+ G…
-
- 执行命令
```docker build -t spu:v1 -f release-ci.DockerFile .```
- 报错内容
```
Sending build context to Docker daemon 3.584 kB
Step 1/16 : FROM centos:centos7
---> eeb6ee3f44bd
Step 2/16 : RUN se…
-
### System Info
```
- `transformers` version: 4.37.0
- Platform: Linux-6.2.0-1017-aws-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.3
- Safetensors version: 0…