Closed Tsardoz closed 10 months ago
You can follow the super_gradients.train_from_recipe
entry point which supports DDP out of the box:
https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/train_from_recipe.py
You can use any of the launch options with it:
1) python -m super_gradients.train_from_recipe.py --config-name=...
(Useful when you installed SG)
1) torchrun ... train_from_recipe.py --config-name=...
3) python train_from_recipe.py --config-name=...
The startup sequence should be as follows:
init_trainer()
setup_device() # See example [here](https://github.com/Deci-AI/super-gradients/blob/5e25574ff452a505084d9edfb0a849175c39dac8/src/super_gradients/training/sg_trainer/sg_trainer.py#L229)
actual_training()
See these docs for more info:
Thanks. I tried using the decorator method first as I had already written the training code. This did not work. When I looked at train_from_recipe I saw that it needed a special yaml file for my custom data set. I do not have time to work out how to do this. I also have trouble with the ubuntu environment I was using on AWS. So I swapped to my single GPU system at home (RTX-4090). I was not able to train on a jupyter notebook but exactly the same code as a converted .py file was able to train! I realise I have not provided anywhere near sufficient information for anyone to help but there are several things going on here. I am happy for now as I am able to train. I am confident the train_from_recipe method as proposed by @BloodAxe is the best way forward and when I have time I will try this. Just letting you know I appreciate your input but do not have time to try this out just yet. I am particularly confused though why I am having so much trouble with notebooks though. Thanks for the good work!
Hi @Tsardoz
Thanks for opening an issue for SG. I'm gathering some feedback on SuperGradients and YOLO-NAS.
Would you be down for a quick call to chat about your experience?
If a call doesn't work for you, no worries. I've got a short survey you could fill out: https://bit.ly/sgyn-feedback.
I know you’re super busy, but your input will help us shape the direction of SuperGradients and make it as useful as possible for you.
I appreciate your time and feedback. Let me know what works for you.
Cheers,
Harpreet
🐛 Describe the bug
Using AWS P3.8xlarge or P3.16xlarge, num_gpus = 4 or 8. device = 'cuda' torch 1.13.1+cu117 super_gradients.version = 3.1.1
The setup_device line causes it to go through all training images with the following error (with seprate file names of course) I am able to run it on home PC with one GPU. Any idea what is happening or how I can fix it?
"find: cannot delete ‘/home/ubuntu/train/c98f7e07a529736a0e31ccf5ac477340391af4397a35b1abc91072d4c86f31da.jpg’: No such file or directory"
Note these files do exist...
`from super_gradients import setup_device from super_gradients.training import Trainer from super_gradients.training import MultiGPUMode from super_gradients.training import dataloaders
Versions
PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.35
Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.19.0-1025-aws-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 470.182.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz CPU family: 6 Model: 79 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4600.02 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt Hypervisor vendor: Xen Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 4 MiB (16 instances) L3 cache: 45 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Versions of relevant libraries: [pip3] numpy==1.23.0 [pip3] torch==1.13.1 [pip3] torchinfo==1.8.0 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.14.1 [pip3] triton==2.0.0 [conda] No relevant packages