Yuanhy1997 / SeqDiffuSeq

Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation [NAACL 2024]
https://arxiv.org/abs/2212.10325
88 stars 14 forks source link

Can't run any train script #24

Closed xiang-xiang-zhu closed 11 months ago

xiang-xiang-zhu commented 11 months ago

When I tried to run the training script, I was reminded that mpi4py was missing, so I installed mpi4py

pip install mpi4py

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Collecting mpi4py
  Using cached http://mirrors.aliyun.com/pypi/packages/2e/1a/1393e69df9cf7b04143a51776727dd048586781bca82543594ab439e2eb4/mpi4py-3.1.5.tar.gz (2.5 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... done
  Created wheel for mpi4py: filename=mpi4py-3.1.5-cp38-cp38-linux_x86_64.whl size=6024408 sha256=64ef1c54d03ecb2c862c4e57da02d6dd8d9e33673ad3948afafca08d60edfd64
  Stored in directory: /root/.cache/pip/wheels/9d/2a/7e/c6575a1d595c7d8cce796177f1b9827975c5b48b31e28f25b9
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-3.1.5

Then I re-ran the training script, and there was no output at all.

bash ./train_scripts/iwslt_en_de.sh 0 de en

I waited for a while, but the program still didn't output anything. I don't know what's wrong. My operating system is Ubuntu. Is it possible that it's an MPI problem?

Yuanhy1997 commented 11 months ago

Thanks for interested in our work. Program hangs usually due to the communications between the GPUs from my experience. Have you checked your nodes supports multi-GPU training. And for mpi4py, please try install with conda, as far as I can recall, pip installed mpi4py is not a complete installation of mpi4py.

xiang-xiang-zhu commented 11 months ago

Thanks for interested in our work. Program hangs usually due to the communications between the GPUs from my experience. Have you checked your nodes supports multi-GPU training. And for mpi4py, please try install with conda, as far as I can recall, pip installed mpi4py is not a complete installation of mpi4py.

install with conda works!!! thank u