3dlg-hcvc / M3DRef-CLIP

[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects
https://3dlg-hcvc.github.io/multi3drefer/
MIT License
75 stars 3 forks source link

The training speed becomes so slow after a few epochs #9

Closed ifrozenwhale closed 12 months ago

ifrozenwhale commented 1 year ago

Great work, and the code is clearly written! However, when I was training with the default configuration on a single NVIDIA 3090 gpu, I noticed something strange.

  1. When using only 3d features, the training and reasoning speed are relatively fast, but after 20 hours (the 24th epoch) it becomes very slow (tens of times slower), and the gpu occupancy, power consumption, etc., are significantly reduced.
  2. When using 2d+3d features, validation of an epoch takes more than 5 hours (I'm not sure if this is normal), and training becomes particularly slow after the first validation epoch (again tens of times slower), which confuses me. image

Have you ever encountered these problems? Looking forward to your reply very much, thanks!

ifrozenwhale commented 1 year ago

And this is system state when training with 3d+2d(clip) features on scanrefer dataset. As it shows, the gpu utilization becomes low after the first val epoch and the trainning speed becomes slow. image image

This is my system info. operation system

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

cpu info

ThinkStation-P910  22:55:02 
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             56
On-line CPU(s) list:                0-55
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              79
Model name:                         Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:                           1
CPU MHz:                            1200.000
CPU max MHz:                        3500.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           5187.88
Virtualization:                     VT-x
L1d cache:                          896 KiB
L1i cache:                          896 KiB
L2 cache:                           7 MiB
L3 cache:                           70 MiB
NUMA node0 CPU(s):                  0-13,28-41
NUMA node1 CPU(s):                  14-27,42-55
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant
                                    _tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr 
                                    pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_
                                    single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_
                                    a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

memory usage.

              total        used        free      shared  buff/cache   available
Mem:      131878612    16311224     4772492    28355896   110794896    86140472
Swap:       2097148     2097148           0

gpu usage

Fri Nov 10 22:55:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 37%   71C    P2   125W / 350W |   9342MiB / 24265MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1743      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      2272      G   /usr/lib/xorg/Xorg                 86MiB |
|    0   N/A  N/A      2407      G   /usr/bin/gnome-shell               22MiB |
|    0   N/A  N/A     31394      G   /usr/lib/firefox/firefox           14MiB |
|    0   N/A  N/A     37472      C   python                           9099MiB |
|    0   N/A  N/A     89826      G   ...2gtk-4.0/WebKitWebProcess       11MiB |
+-----------------------------------------------------------------------------+

Thanks!

eamonn-zh commented 1 year ago

Hi @ifrozenwhale,

Thank you for trying our work!

First, I want to point out that this behavior is not normal. As we mentioned in Table 14 in our paper, the inference time on the entire ScanRefer val (one epoch) is ~13 minutes in total.

Here are my training plots for the ScanRefer dataset (on a single NVIDIA RTX A5000, using 3D+2D features): Screenshot from 2023-11-10 16-02-32 Screenshot from 2023-11-10 16-14-03

We tested it on a single RTX 3090 before and observed a similar training/inference speed. Sorry that I cannot provide much help for your situation since we haven't encountered such issues. Let me know if you want me to provide more plots or details.

ifrozenwhale commented 1 year ago

Hi @ifrozenwhale,

Thank you for trying our work!

First, I want to point out that this behavior is not normal. As we mentioned in Table 14 in our paper, the inference time on the entire ScanRefer val (one epoch) is ~13 minutes in total.

Here are my training plots for the ScanRefer dataset (on a single NVIDIA RTX A5000, using 3D+2D features): Screenshot from 2023-11-10 16-02-32 Screenshot from 2023-11-10 16-14-03

We tested it on a single RTX 3090 before and observed a similar training/inference speed. Sorry that I cannot provide much help for your situation since we haven't encountered such issues. Let me know if you want me to provide more plots or details.

@eamonn-zh Thanks for your reply. It is helpful for me! And there are a few things I'd like to confirm. I would like to ask again whether the above process is the process of rendering multiple views using clip? And could you please provide some information about conda environment for my reference? And If cpu information could be provided, I think it will be more helpful! Thanks!!!

eamonn-zh commented 1 year ago

@ifrozenwhale Yes, the above plots are our best setting that uses 3D + 2D features (render 3 views for each object and encode them using CLIP) and contrastive loss. My conda environment is exactly the same as we described in the README (Python 3.10 + PyTorch 2.0.1 with CUDA 11.7). Previously, we also did some experiments using other PyTorch versions (e.g. 1.13) and we didn't notice any issues. Maybe one suggestion from my side is checking the bottleneck of your running speed (see which part of the code gets slower after some epochs), low GPU utilization usually means there are bottlenecks on CPU or disk IO, i.e., GPU is just waiting for data so it is doing nothing.

ifrozenwhale commented 12 months ago

@eamonn-zh Thansk for your reply! I finally solved the problem by setting pin_memory to False and everything seems to work fine. It really surprised me. When pin_memory is set to true, I see multiple core 100% cpu usage cases that are used by sys instead of user. I trace that these processes are stuck in the (virtual) memory allocation/mapping link, the munmap function. This leads me to think that there may be a problem with pin_memory. I found a similar problem in the pytorch issue, but it didn't seem to be solved very well. In a word, the problem has finally been solved. Thank you for your detailed information and suggestions!

eamonn-zh commented 12 months ago

@ifrozenwhale Great to see you solved the problem. Btw, how large is your RAM? Do you think you ran out of your RAM? Since this code pre-loads all data into the RAM, your machine should have at least 32GB available RAM to run it for the ScanRefer dataset. If that's not the case, please move the data loading code to the __getitem__ function or reduce the dataloader workers, so that you read the data from your disk instead of the RAM.

ifrozenwhale commented 12 months ago

@eamonn-zh I have a total of 128G of memory, and the program uses about 35G when pin_memory=True, and only about 8G when pin_memory=False (I don't know why it's so low).