Closed akouri-dd closed 2 years ago
This is strange. The only thing I can think of could be that the parser is not feeding images to the training in the generator, meaning it would get stuck getting items in the first iteration of the training loop. Can you share your parser? Are you using a custom one?
I am using the default semantic-kitti.yaml
parser along with the semantic kitti labels, and then I am using the sequences from the Kitti Odometry dataset.
I have determined that the line it is getting stuck on is this one:
output = model(in_vol, proj_mask)
And I added some print statements to make sure it was seeing the .label
and .bin
files in the sequences
directories, and it is...
Will continue investigating.
I tried changing the batch size from 8 down to 2
and it works now... weird.
I have the same problem and I solve it by re-running. I also find the gpu memory allocation is not stable.
I have same problem with error:
----------
INTERFACE:
dataset dataset/dataset
arch_cfg config/arch/darknet53.yaml
data_cfg config/labels/semantic-kitti.yaml
log runs
pretrained None
----------
Commit hash (training version): b'4233111'
----------
Opening arch config file config/arch/darknet53.yaml
Opening data config file config/labels/semantic-kitti.yaml
No pretrained directory found.
Copying files to runs for further reference.
Sequences folder exists! Using sequences from dataset/dataset/sequences
parsing seq 00
parsing seq 01
parsing seq 02
parsing seq 03
parsing seq 04
parsing seq 05
parsing seq 06
parsing seq 07
parsing seq 09
parsing seq 10
Using 19130 scans from sequences [0, 1, 2, 3, 4, 5, 6, 7, 9, 10]
Sequences folder exists! Using sequences from dataset/dataset/sequences
parsing seq 08
Using 4071 scans from sequences [8]
Loss weights from content: tensor([ 0.0000, 22.9317, 857.5627, 715.1100, 315.9618, 356.2452, 747.6170,
887.2239, 963.8915, 5.0051, 63.6247, 6.9002, 203.8796, 7.4802,
13.6315, 3.7339, 142.1462, 12.6355, 259.3699, 618.9667])
Using DarknetNet53 Backbone
Depth of backbone input = 5
Original OS: 32
New OS: 32
Strides: [2, 2, 2, 2, 2]
Decoder original OS: 32
Decoder new OS: 32
Decoder strides: [2, 2, 2, 2, 2]
Total number of parameters: 50377364
Total number of parameters requires_grad: 50377364
Param encoder 40585504
Param decoder 9786080
Param head 5780
No path to pretrained, using random init.
Training in device: cpu
/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Ignoring class 0 in IoU evaluation
[IOU EVAL] IGNORE: tensor([0])
[IOU EVAL] INCLUDE: tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19])
Traceback (most recent call last):
File "/mnt/han/lidar-bonnetal/train/tasks/semantic/train.py", line 115, in <module>
trainer.train()
File "../../tasks/semantic/modules/trainer.py", line 236, in train
show_scans=self.ARCH["train"]["show_scans"])
File "../../tasks/semantic/modules/trainer.py", line 318, in train_epoch
loss = criterion(torch.log(output.clamp(min=1e-8)), proj_labels)
File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 204, in forward
return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/shkim/anaconda3/envs/han/lib/python3.6/site-packages/torch/nn/functional.py", line 1840, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: expected scalar type Long but found Int
Can you please tell that what are the minimum system requirements to run this properly. like ram and gpu for training. thanks.
check your shared memory allocation in your container with df -h command. Run the container with --shm-size 32G option. It should work then.
A year ago I ran the experiment successfully. But now the same container stuck during training and cannot be killed. I've run the container with --shm-size 32G. Weird...
I am using the default
semantic-kitti.yaml
parser along with the semantic kitti labels, and then I am using the sequences from the Kitti Odometry dataset.I have determined that the line it is getting stuck on is this one:
output = model(in_vol, proj_mask)
And I added some print statements to make sure it was seeing the
.label
and.bin
files in thesequences
directories, and it is...Will continue investigating.
The code is getting stuck on this line in segmentator.py:
y, skips = self.backbone(x)
It cannot reach code in backbone folder.
Is there any update/solution to this issue?
Hello, I am trying to train the model (both from scratch and on a pretrained model) on the SemanticKitti dataset, which I've downloaded. I am running the training on a machine with 2x 1080Ti.
I have let the training sit for about 1 hour, and nothing has happened so far. The
tb
directory is also empty, so I am not sure if it is actually doing anything.Interestingly,
nvidia-smi
shows that the GPUs are idle, but memory has been allocated on them.