-
susie.sun@yz-amd1:~$ docker run -it rocm/deepspeed:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1_DeepSpeed /bin/bash
root@c50e90963e1a:/var/lib/jenkins# deepspeed --num_gpus 1 deploy.py
[2023-12-14 01:52:…
-
Hi,
I'm new in github and MPI (mpiexec) usages, so I try to run a process that can run in more than one thread. So, I used hwthreads. But, the problem is that hwthread is just limited to one node, …
-
HOSTFILE=""
# HOSTFILE不太理解是什么意思
# ====== Parameters ======
DATA_PATH=""
# 这个数据源有没有开源数据集
-
I have like 30-40 hostfiles and ran a script to create files for all hostfiles and the restarted gasmask. Then it read the files seemingly random. So it would be good if there was a sort function or t…
-
### System Info
```shell
deepspeed 0.14.4+hpu.synapse.v1.18.0
optimum-habana 1.14.0
docker image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-ins…
-
Currently hostfile updater updates every 60 seconds (or fixed time period) which is not a good design given touching the hostfile clears the local DNS caches. Hence, we would like to make it update on…
-
**Describe the bug**
This issue occurs on a SLURM cluster where worker nodes equipped with multiple GPU's are shared amongst users. GPU's are given slot number assignments (for example, on a node wit…
-
**Describe the bug**
**Log output**
After configuring the hostfile using pdsh, I use command `deepspeed --num_nodes 2 hostfile=hostfile.txt train.py`,But I find deepspeed login into other machine …
-
### Describe the issue
Issue:
We collect a large-scale instruction dataset, and want to use muti-nodes training. When using the following script, the traing time is too slow and no log about time.
…
-
Hello,
I'm hitting problems running MPI jobs that require more than one node using the system install of OpenMPI 4.1.1 in Rocky 8.6. Specifically, the following script runs on a single 2 core `m3…