MultiGPUMode.DISTRIBUTED_DATA_PARALLEL makes it go crazy

Tsardoz commented 1 year ago

🐛 Describe the bug

Using AWS P3.8xlarge or P3.16xlarge, num_gpus = 4 or 8. device = 'cuda' torch 1.13.1+cu117 super_gradients.version = 3.1.1

The setup_device line causes it to go through all training images with the following error (with seprate file names of course) I am able to run it on home PC with one GPU. Any idea what is happening or how I can fix it?

"find: cannot delete ‘/home/ubuntu/train/c98f7e07a529736a0e31ccf5ac477340391af4397a35b1abc91072d4c86f31da.jpg’: No such file or directory"

Note these files do exist...

`from super_gradients import setup_device from super_gradients.training import Trainer from super_gradients.training import MultiGPUMode from super_gradients.training import dataloaders

train_data = coco_detection_yolo_format_train(
    dataset_params={'data_dir': dataset_params['data_dir'],'images_dir': dataset_params['train_images_dir'],
                    'labels_dir': dataset_params['train_labels_dir'],'classes': dataset_params['classes']},
    dataloader_params={'batch_size':BATCH,'num_workers':workers}
)

val_data = coco_detection_yolo_format_val(
    dataset_params={'data_dir': dataset_params['data_dir'],'images_dir': dataset_params['val_images_dir'],
                    'labels_dir': dataset_params['val_labels_dir'],'classes': dataset_params['classes']},
    dataloader_params={'batch_size':BATCH,'num_workers':workers}
)

test_data = coco_detection_yolo_format_val(
    dataset_params={'data_dir': dataset_params['data_dir'],'images_dir': dataset_params['test_images_dir'],
                    'labels_dir': dataset_params['test_labels_dir'],'classes': dataset_params['classes']},
    dataloader_params={'batch_size':BATCH,'num_workers':workers}
)

exp_number = 1
if num_gpus <= 1:
    setup_device(device=device, num_gpus=num_gpus)
    trainer = Trainer(experiment_name=f'yolonas_{exp_number}', ckpt_root_dir=CHECKPOINT_DIR)
else:
    #************ This line causes problems **********
    setup_device(device=device, multi_gpu=MultiGPUMode.DISTRIBUTED_DATA_PARALLEL, num_gpus=num_gpus)
    trainer = Trainer(experiment_name=f'yolonas_falls_{exp_number}', ckpt_root_dir=CHECKPOINT_DIR)`

Versions

PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-1025-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 470.182.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz CPU family: 6 Model: 79 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.02 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt Hypervisor vendor: Xen Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 4 MiB (16 instances) L3 cache: 45 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries: [pip3] numpy==1.23.0 [pip3] torch==1.13.1 [pip3] torchinfo==1.8.0 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.14.1 [pip3] triton==2.0.0 [conda] No relevant packages

BloodAxe commented 1 year ago

You can follow the super_gradients.train_from_recipe entry point which supports DDP out of the box: https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/train_from_recipe.py

You can use any of the launch options with it: 1) python -m super_gradients.train_from_recipe.py --config-name=... (Useful when you installed SG) 1) torchrun ... train_from_recipe.py --config-name=... 3) python train_from_recipe.py --config-name=...

The startup sequence should be as follows:

init_trainer()
setup_device()  # See example [here](https://github.com/Deci-AI/super-gradients/blob/5e25574ff452a505084d9edfb0a849175c39dac8/src/super_gradients/training/sg_trainer/sg_trainer.py#L229)
actual_training()

See these docs for more info:

Tsardoz commented 1 year ago

Thanks. I tried using the decorator method first as I had already written the training code. This did not work. When I looked at train_from_recipe I saw that it needed a special yaml file for my custom data set. I do not have time to work out how to do this. I also have trouble with the ubuntu environment I was using on AWS. So I swapped to my single GPU system at home (RTX-4090). I was not able to train on a jupyter notebook but exactly the same code as a converted .py file was able to train! I realise I have not provided anywhere near sufficient information for anyone to help but there are several things going on here. I am happy for now as I am able to train. I am confident the train_from_recipe method as proposed by @BloodAxe is the best way forward and when I have time I will try this. Just letting you know I appreciate your input but do not have time to try this out just yet. I am particularly confused though why I am having so much trouble with notebooks though. Thanks for the good work!

harpreetsahota204 commented 1 year ago

Hi @Tsardoz

Thanks for opening an issue for SG. I'm gathering some feedback on SuperGradients and YOLO-NAS.

Would you be down for a quick call to chat about your experience?

If a call doesn't work for you, no worries. I've got a short survey you could fill out: https://bit.ly/sgyn-feedback.

I know you’re super busy, but your input will help us shape the direction of SuperGradients and make it as useful as possible for you.

I appreciate your time and feedback. Let me know what works for you.

Cheers,

Harpreet

Deci-AI / super-gradients

MultiGPUMode.DISTRIBUTED_DATA_PARALLEL makes it go crazy #1087

🐛 Describe the bug

Versions