Xilinx / Vitis-AI-Tutorials

MIT License
358 stars 144 forks source link

14-caffe-ssd-pascalnot converg #57

Open mhanuel26 opened 2 years ago

mhanuel26 commented 2 years ago

I have done a training on this model for VOC, follow every step of the tutorial and after some long time it seems the training did not converge, is that the right term. After running the score.sh script on the snapshot_iter_120000.caffemodel I am getting (end of log)

I0311 19:11:02.286180   515 net.cpp:284] Network initialization done.
I0311 19:11:02.610352   515 net.cpp:823] Ignoring source layer mbox_loss
I0311 19:11:02.610754   515 caffe.cpp:574] Running for 4952 iterations.
I0311 19:20:58.740268   515 caffe.cpp:438]     Test net output #0: detection_eval = 0.00244108

I have an RTX 3060, nvidia-smi output

Fri Mar 11 20:59:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P8    25W / 170W |   1150MiB / 12288MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1122      G   /usr/lib/xorg/Xorg                640MiB |
|    0   N/A  N/A      1465      G   /usr/bin/gnome-shell              139MiB |
|    0   N/A  N/A      3933      G   ...AAAAAAAAA= --shared-files      107MiB |
|    0   N/A  N/A    443952      C   caffe                             191MiB |
|    0   N/A  N/A   2159972      G   ...952486002011016735,131072       65MiB |
+-----------------------------------------------------------------------------+

My training stop accidentally a little after iteration 50000 so I use the following command to resume

caffe train -solver solver.prototxt -snapshot /workspace/SSD/workspace/Mobilenetv2-SSD/snapshots/snapshot_iter_50000.solverstate -gpu 0 2>&1 | tee SSD_train_2.log

but I run score at snapshot 20000 and it is worse

I0311 21:09:44.394701   591 net.cpp:284] Network initialization done.
I0311 21:09:45.296046   591 net.cpp:823] Ignoring source layer mbox_loss
I0311 21:09:45.296661   591 caffe.cpp:574] Running for 4952 iterations.
I0311 21:21:22.749675   591 caffe.cpp:438]     Test net output #0: detection_eval = 0.000904203

How can I solve this ?