Closed megatomik closed 1 month ago
Hi @megatomik ,
Could you share more details when you're running code? (e.g. Environment, Server Hardware, Run command, etc.)
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
nvidia-smi
Thu May 30 13:09:22 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:41:00.0 Off | N/A |
| 0% 49C P8 42W / 310W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:42:00.0 Off | N/A |
| 0% 45C P8 39W / 310W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:81:00.0 Off | N/A |
| 31% 45C P8 23W / 310W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:C1:00.0 Off | N/A |
| 0% 45C P8 30W / 310W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
I run torchrun manually like this
torchrun --nproc-per-node=2 lumina_next_t2i/train.py \
--master_port 18181 \
--model NextDiT_2B_GQA_patch2 \
--data_path trainconfig.json \
--results_dir results/ \
--micro_batch_size 1 \
--global_batch_size 2 --lr 1e-4 \
--data_parallel fsdp \
--max_steps 300 \
--ckpt_every 10 --log_every 1 \
--precision bf16 --grad_precision fp32 --qk_norm \
--image_size 256 \
--vae sdxl
which gives
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING]
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] *****************************************
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-30 13:15:19,060] torch.distributed.run: [WARNING] *****************************************
> initializing model parallel with size 1
> initializing ddp with size 2
> initializing pipeline with size 1
CPU is AMD EPYC 7542 with 128 GB of RAM. It's a cloud machine running Docker image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
do you have nvlinks with your server? if have, please add NCCL_P2P_LEVEL=NVL
in front of torchrun
.
do you have nvlinks with your server? if have, please add
NCCL_P2P_LEVEL=NVL
in front oftorchrun
.
That fixed it, thanks. I didn't see anything mentioning that in the readme though so perhaps it would be a good idea to add it?
do you have nvlinks with your server? if have, please add
NCCL_P2P_LEVEL=NVL
in front oftorchrun
.That fixed it, thanks. I didn't see anything mentioning that in the readme though so perhaps it would be a good idea to add it?
ok, we will add this line in our readme.
Basically self explanatory. I've installed everything and can infer just fine, I can even almost train on 1 GPU (I do OOM eventually unfortunately) but when I set nproc-per-node to 2 instead of 1 I just get stuck. I've tried both the default 5B script and 2B Next