Closed ifrozenwhale closed 12 months ago
And this is system state when training with 3d+2d(clip) features on scanrefer dataset. As it shows, the gpu utilization becomes low after the first val epoch and the trainning speed becomes slow.
This is my system info. operation system
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
cpu info
ThinkStation-P910 22:55:02
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping: 1
CPU MHz: 1200.000
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5187.88
Virtualization: VT-x
L1d cache: 896 KiB
L1i cache: 896 KiB
L2 cache: 7 MiB
L3 cache: 70 MiB
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant
_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr
pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_
single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_
a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
memory usage.
total used free shared buff/cache available
Mem: 131878612 16311224 4772492 28355896 110794896 86140472
Swap: 2097148 2097148 0
gpu usage
Fri Nov 10 22:55:32 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02 Driver Version: 470.223.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 37% 71C P2 125W / 350W | 9342MiB / 24265MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1743 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 2272 G /usr/lib/xorg/Xorg 86MiB |
| 0 N/A N/A 2407 G /usr/bin/gnome-shell 22MiB |
| 0 N/A N/A 31394 G /usr/lib/firefox/firefox 14MiB |
| 0 N/A N/A 37472 C python 9099MiB |
| 0 N/A N/A 89826 G ...2gtk-4.0/WebKitWebProcess 11MiB |
+-----------------------------------------------------------------------------+
Thanks!
Hi @ifrozenwhale,
Thank you for trying our work!
First, I want to point out that this behavior is not normal. As we mentioned in Table 14 in our paper, the inference time on the entire ScanRefer val (one epoch) is ~13 minutes in total.
Here are my training plots for the ScanRefer dataset (on a single NVIDIA RTX A5000, using 3D+2D features):
We tested it on a single RTX 3090 before and observed a similar training/inference speed. Sorry that I cannot provide much help for your situation since we haven't encountered such issues. Let me know if you want me to provide more plots or details.
Hi @ifrozenwhale,
Thank you for trying our work!
First, I want to point out that this behavior is not normal. As we mentioned in Table 14 in our paper, the inference time on the entire ScanRefer val (one epoch) is ~13 minutes in total.
Here are my training plots for the ScanRefer dataset (on a single NVIDIA RTX A5000, using 3D+2D features):
We tested it on a single RTX 3090 before and observed a similar training/inference speed. Sorry that I cannot provide much help for your situation since we haven't encountered such issues. Let me know if you want me to provide more plots or details.
@eamonn-zh Thanks for your reply. It is helpful for me! And there are a few things I'd like to confirm. I would like to ask again whether the above process is the process of rendering multiple views using clip? And could you please provide some information about conda environment for my reference? And If cpu information could be provided, I think it will be more helpful! Thanks!!!
@ifrozenwhale Yes, the above plots are our best setting that uses 3D + 2D features (render 3 views for each object and encode them using CLIP) and contrastive loss. My conda environment is exactly the same as we described in the README (Python 3.10 + PyTorch 2.0.1 with CUDA 11.7). Previously, we also did some experiments using other PyTorch versions (e.g. 1.13) and we didn't notice any issues. Maybe one suggestion from my side is checking the bottleneck of your running speed (see which part of the code gets slower after some epochs), low GPU utilization usually means there are bottlenecks on CPU or disk IO, i.e., GPU is just waiting for data so it is doing nothing.
@eamonn-zh Thansk for your reply! I finally solved the problem by setting pin_memory to False and everything seems to work fine. It really surprised me. When pin_memory is set to true, I see multiple core 100% cpu usage cases that are used by sys instead of user. I trace that these processes are stuck in the (virtual) memory allocation/mapping link, the munmap function. This leads me to think that there may be a problem with pin_memory. I found a similar problem in the pytorch issue, but it didn't seem to be solved very well. In a word, the problem has finally been solved. Thank you for your detailed information and suggestions!
@ifrozenwhale Great to see you solved the problem. Btw, how large is your RAM? Do you think you ran out of your RAM? Since this code pre-loads all data into the RAM, your machine should have at least 32GB available RAM to run it for the ScanRefer dataset. If that's not the case, please move the data loading code to the __getitem__
function or reduce the dataloader workers, so that you read the data from your disk instead of the RAM.
@eamonn-zh I have a total of 128G of memory, and the program uses about 35G when pin_memory=True, and only about 8G when pin_memory=False (I don't know why it's so low).
Great work, and the code is clearly written! However, when I was training with the default configuration on a single NVIDIA 3090 gpu, I noticed something strange.
Have you ever encountered these problems? Looking forward to your reply very much, thanks!