Stuck during training? - Githubissues

akouri-dd commented 4 years ago

Hello, I am trying to train the model (both from scratch and on a pretrained model) on the SemanticKitti dataset, which I've downloaded. I am running the training on a machine with 2x 1080Ti.

I have let the training sit for about 1 hour, and nothing has happened so far. The tb directory is also empty, so I am not sure if it is actually doing anything.

Interestingly, nvidia-smi shows that the GPUs are idle, but memory has been allocated on them.

INTERFACE:
dataset /home/ddlabs/data/kitti/
arch_cfg /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/arch_cfg.yaml
data_cfg /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/data_cfg.yaml
log /tmp/train_log/
pretrained /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/
----------

Commit hash (training version):  b'4233111'
----------

Opening arch config file /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/arch_cfg.yaml
Opening data config file /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/data_cfg.yaml
model folder exists! Using model from /home/ddlabs/catkin_ws/src/rangenet_lib/darknet53/
Copying files to /tmp/train_log/ for further reference.
Sequences folder exists! Using sequences from /home/ddlabs/data/kitti/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from /home/ddlabs/data/kitti/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content:  tensor([  0.0000,  22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
        887.2239, 963.8915,   5.0051,  63.6247,   6.9002, 203.8796,   7.4802,
         13.6315,   3.7339, 142.1462,  12.6355, 259.3699, 618.9667])
Using DarknetNet53 Backbone
Depth of backbone input =  5
Original OS:  32
New OS:  32
Strides:  [2, 2, 2, 2, 2]
Decoder original OS:  32
Decoder new OS:  32
Decoder strides:  [2, 2, 2, 2, 2]
Total number of parameters:  50377364
Total number of parameters requires_grad:  50377364
Param encoder  40585504
Param decoder  9786080
Param head  5780
Successfully loaded model backbone weights
Successfully loaded model decoder weights
Successfully loaded model head weights
Training in device:  cuda
Let's use 2 GPUs!
Ignoring class  0  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([0])
[IOU EVAL] INCLUDE:  tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19])

tano297 commented 4 years ago

This is strange. The only thing I can think of could be that the parser is not feeding images to the training in the generator, meaning it would get stuck getting items in the first iteration of the training loop. Can you share your parser? Are you using a custom one?

akouri-dd commented 4 years ago

I am using the default semantic-kitti.yaml parser along with the semantic kitti labels, and then I am using the sequences from the Kitti Odometry dataset.

I have determined that the line it is getting stuck on is this one:

output = model(in_vol, proj_mask)

And I added some print statements to make sure it was seeing the .label and .bin files in the sequences directories, and it is...

Will continue investigating.

akouri-dd commented 4 years ago

I tried changing the batch size from 8 down to 2 and it works now... weird.

ghost commented 4 years ago

I have the same problem and I solve it by re-running. I also find the gpu memory allocation is not stable.

jiseongHAN commented 4 years ago

I have same problem with error:

----------
INTERFACE:
dataset dataset/dataset
arch_cfg config/arch/darknet53.yaml
data_cfg config/labels/semantic-kitti.yaml
log runs
pretrained None
----------

Commit hash (training version):  b'4233111'
----------

Opening arch config file config/arch/darknet53.yaml
Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to runs for further reference.
Sequences folder exists! Using sequences from dataset/dataset/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from dataset/dataset/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content:  tensor([  0.0000,  22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
        887.2239, 963.8915,   5.0051,  63.6247,   6.9002, 203.8796,   7.4802,
         13.6315,   3.7339, 142.1462,  12.6355, 259.3699, 618.9667])
Using DarknetNet53 Backbone
Depth of backbone input =  5
Original OS:  32
New OS:  32
Strides:  [2, 2, 2, 2, 2]
Decoder original OS:  32
Decoder new OS:  32
Decoder strides:  [2, 2, 2, 2, 2]
Total number of parameters:  50377364
Total number of parameters requires_grad:  50377364
Param encoder  40585504
Param decoder  9786080
Param head  5780
No path to pretrained, using random init.
Training in device:  cpu
/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Ignoring class  0  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([0])
[IOU EVAL] INCLUDE:  tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19])
Traceback (most recent call last):
  File "/mnt/han/lidar-bonnetal/train/tasks/semantic/train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 236, in train
    show_scans=self.ARCH["train"]["show_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 318, in train_epoch
    loss = criterion(torch.log(output.clamp(min=1e-8)), proj_labels)
  File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 204, in forward
    return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/functional.py", line 1840, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: expected scalar type Long but found Int

saad7778 commented 4 years ago

Can you please tell that what are the minimum system requirements to run this properly. like ram and gpu for training. thanks.

NagarajDesai1 commented 4 years ago

check your shared memory allocation in your container with df -h command. Run the container with --shm-size 32G option. It should work then.

BenjaminYoung29 commented 4 years ago

A year ago I ran the experiment successfully. But now the same container stuck during training and cannot be killed. I've run the container with --shm-size 32G. Weird...

BenjaminYoung29 commented 4 years ago

I am using the default semantic-kitti.yaml parser along with the semantic kitti labels, and then I am using the sequences from the Kitti Odometry dataset.

I have determined that the line it is getting stuck on is this one:
output = model(in_vol, proj_mask)
And I added some print statements to make sure it was seeing the .label and .bin files in the sequences directories, and it is...

Will continue investigating.

The code is getting stuck on this line in segmentator.py: y, skips = self.backbone(x)

It cannot reach code in backbone folder.

guragamb commented 3 years ago

Is there any update/solution to this issue?

PRBonn / lidar-bonnetal

Stuck during training? #39