chenfengxu714 / SqueezeSegV3

BSD 2-Clause "Simplified" License
223 stars 51 forks source link

Get's stuck during training initialization #26

Open guragamb opened 3 years ago

guragamb commented 3 years ago

Hi! I was trying to get the repo working but it gets stuck right before the training begins (all the files are in the correct directories, python packages are exactly the same as what you outline in the requirements.txt file).

Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to /host-machine/semKITTI/lidar-bonnetal/logging/ for further reference.
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /host-machine/semKITTI/Data/SemanticKitti/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content:  tensor([  0.0000,  22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
        887.2239, 963.8915,   5.0051,  63.6247,   6.9002, 203.8796,   7.4802,
         13.6315,   3.7339, 142.1462,  12.6355, 259.3699, 618.9667])
Using SqueezeNet Backbone
Depth of backbone input =  5
Original OS:  16
New OS:  16
Strides:  [2, 2, 2, 2]
Decoder original OS:  16
Decoder new OS:  16
Decoder strides:  [2, 2, 2, 2]
Using CRF!
Total number of parameters:  928889
Total number of parameters requires_grad:  928884
Param encoder  735676
Param decoder  181248
Param head  11540
Param CRF  425
No path to pretrained, using random init.
Training in device:  cuda
Let's use 2 GPUs!
Ignoring class  0  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([0])
[IOU EVAL] INCLUDE:  tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19])

It gets stuck after the last line and doesn't do anything (nothing gets updated to the logs either). I know you mentioned that you referenced the RangeNet++ project for the development of SqueezeSegV3 and they have a similar issue where training gets stuck (I was able to reproduce the error on their repo as well) - https://github.com/PRBonn/lidar-bonnetal/issues/39. Would really appreciate if you had any thoughts on why this might be happening!

chenfengxu714 commented 3 years ago

Hi, thanks for your inform. It seems that some dependencies have conflicts now. I have updated a new requirements file and some codes, and tested on a new machine. Everything seems to work now.

Meanwhile, these codes needs large GPU memory. I train V321 with mini-batch 2 on each GPU (24G) and V351 with mini-batch 1 on each GPU (24G). The whole networks are trained on 8 GPUs with syncBN. If you don't have much GPU memory, I highly recommend you to use less width setting, e.g, 1024 or 512, and try training from my pretrained models instead of training from scratch. We discover that this training method can be close to training from scratch with 2048 width.