Training Issue on own dataset

fhartmann17 commented 4 years ago

Hi @HuguesTHOMAS,

thanks for making your code open-source. I am currently trying to train KPConv on my own Dataset which is in "kitti-format".

But in the last step of each epoch I get stuck in the while loop in SemanticKitty.py line 772-774.

765   # Get the indices to generate thanks to potentials
766   used_classes = self.dataset.num_classes - len(self.dataset.ignored_labels)
767   class_n = num_centers // used_classes + 1
768   if class_n < class_potentials.shape[0]:
769       _, class_indices = torch.topk(class_potentials, class_n, largest=False)
770   else:
771       class_indices = torch.zeros((0,), dtype=torch.int32)
772       while class_indices.shape[0] < class_n:
773              new_class_inds = torch.randperm(class_potentials.shape[0])
774              class_indices = torch.cat((class_indices, new_class_inds), dim=0)
775       class_indices = class_indices[:class_n]
776   class_indices = self.dataset.class_frames[i][class_indices]

(I added the [0] in the while condition)

Usually it shouldn't even enter this part at that stage, should it? Do you maybe know what my problem is and how to solve it?

HuguesTHOMAS commented 4 years ago

Hi @fhartmann17,

Thx for your interest in my code.

First there is indeed a typo with the [0], you were right to add it.

Then, yes the code usually doesn't need to enter this loop. Let me explain why. We want to select, let's say, 1000 lidar scans for one epoch and we have 10 classes. For a balanced training we want to pick scans that contain all classes so here is the strategy:

in the preprocessing steps we first create self.dataset.class_frames[i] that contains the list of scans which have at least one point of the class i.
To be sure we don't always pick the same scans we also create self.dataset.potentials which basically counts how many times we picked each frame of the dataset
This function here want to pick 100 scans per class, while respecting potentials so for each class, we first get the potentials of the scans that contains it: class_potentials = self.dataset.potentials[self.dataset.class_frames[i]]
Then we just pick the 100 lowest potentials for this class and if we have less than 100 scans, we enter this while loop that stacks the sames scans multiple times.

Now when I coded this I did not put safeguard here and in your case I think what happens is that class_potentials.shape[0] = 0, which means you have one class that is not present in any of the scans. Just verify that by printing this shape.

If this is the case then I suggest you check again your data. And if there is nothing you can do about it, just get around this while loop by not choosing any points for a class that is not present

fhartmann17 commented 4 years ago

Thanks for the quick and detailed answer, @HuguesTHOMAS !!

You are right, the class_potentials.shape[0] = 0.

The problem comes from the test_dataset where I choose balance_classes = True. But still I don't understand why this error appears. It says that self.dataset.class_frames[2] = tensor([], dtype=torch.int64) (my class 02 are motorcycles), but there are scans in the test and validation set that have motorcycles.

HuguesTHOMAS commented 4 years ago

If your dataset is based on the implementation of the SemanticKitti dataset, the code does not load labels for the test set, because it is not supposed to know them.

You thus have two choices, you either change your sets: use your current training + validation as the new training and use the test as validation. It is easy to do and makes sense if you have the labels of the test scenes.

Or you search in the code where there is a statement

if self.set == 'test'

and you modify the code everywhere the labels are involved.

fhartmann17 commented 4 years ago

Thanks, @HuguesTHOMAS. You are right with that. I will have a look on that.

Another error that happens, if I am using balance_classes=False and enter the else condition line 817/818 (here: SemanticKitty.py):

gen_indices = torch.randperm(self.dataset.potentials.shape[0])

then the tensor-size of gen_indices and self.dataset.epoch_inds don't match in line 825:

self.dataset.epoch_inds += gen_indices

(the size of self.dataset.epoch_inds is higher than gen_indices)

Do you know how to solve that?

HuguesTHOMAS commented 4 years ago

Ok, the problem with epoch_inds is that it has to be accessible for all the threads of the dataloader so we have to share this tensor, which is done here:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/94b1d87e17ec339e2aad049ce5d5d175b9cd0db1/datasets/SemanticKitti.py#L185

Once shared, you cannot change the tensor size, only the data in it, which is not very convenient. This is why I do

self.dataset.epoch_inds *= 0
...
self.dataset.epoch_inds += gen_indices

instead of a simple

self.dataset.epoch_inds = gen_indices

Now you have many solutions to solve this, a simple example is to add some random indices at the end of gen_indices to complete it so that it has the same size as self.dataset.epoch_inds.

You can also reduce the size of self.dataset.epoch_inds which is controlled by the parameter config.epoch_steps for the training dataset and config.validation_size for the validation and test sets. See here:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/94b1d87e17ec339e2aad049ce5d5d175b9cd0db1/datasets/SemanticKitti.py#L176-L186

fhartmann17 commented 4 years ago

Thanks for your help @HuguesTHOMAS !!

I will close this issue now. Stay safe and good luck for you.

HuguesTHOMAS / KPConv-PyTorch

Training Issue on own dataset #42