Training scripts hangs early at "Initializing pipeline" line with GPUs running at 100% when GPU number > 1

megatomik commented 1 month ago

Basically self explanatory. I've installed everything and can infer just fine, I can even almost train on 1 GPU (I do OOM eventually unfortunately) but when I set nproc-per-node to 2 instead of 1 I just get stuck. I've tried both the default 5B script and 2B Next

PommesPeter commented 1 month ago

Hi @megatomik ,

Could you share more details when you're running code? (e.g. Environment, Server Hardware, Run command, etc.)

megatomik commented 1 month ago

cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

nvidia-smi

Thu May 30 13:09:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:41:00.0 Off |                  N/A |
|  0%   49C    P8              42W / 310W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:42:00.0 Off |                  N/A |
|  0%   45C    P8              39W / 310W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  | 00000000:81:00.0 Off |                  N/A |
| 31%   45C    P8              23W / 310W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  | 00000000:C1:00.0 Off |                  N/A |
|  0%   45C    P8              30W / 310W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I run torchrun manually like this

torchrun --nproc-per-node=2 lumina_next_t2i/train.py \
    --master_port 18181 \
    --model NextDiT_2B_GQA_patch2 \
    --data_path trainconfig.json \
    --results_dir results/ \
    --micro_batch_size 1 \
    --global_batch_size 2 --lr 1e-4 \
    --data_parallel fsdp \
    --max_steps 300 \
    --ckpt_every 10 --log_every 1 \
    --precision bf16 --grad_precision fp32 --qk_norm \
    --image_size 256 \
    --vae sdxl

which gives

[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] 
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] *****************************************
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] *****************************************
> initializing model parallel with size 1
> initializing ddp with size 2
> initializing pipeline with size 1

CPU is AMD EPYC 7542 with 128 GB of RAM. It's a cloud machine running Docker image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

PommesPeter commented 1 month ago

do you have nvlinks with your server? if have, please add NCCL_P2P_LEVEL=NVL in front of torchrun.

megatomik commented 1 month ago

do you have nvlinks with your server? if have, please add NCCL_P2P_LEVEL=NVL in front of torchrun.

That fixed it, thanks. I didn't see anything mentioning that in the readme though so perhaps it would be a good idea to add it?

PommesPeter commented 4 weeks ago

do you have nvlinks with your server? if have, please add NCCL_P2P_LEVEL=NVL in front of torchrun.

That fixed it, thanks. I didn't see anything mentioning that in the readme though so perhaps it would be a good idea to add it?

ok, we will add this line in our readme.

Alpha-VLLM / Lumina-T2X

Training scripts hangs early at "Initializing pipeline" line with GPUs running at 100% when GPU number > 1 #47