Loss function problem - Githubissues

BenjaminYoung29 commented 5 years ago

Hi, In your paper, it said during training the network used a weighted cross-entropy loss function. However, if i have not mistaken the code, it uses nn.NLLLoss instead of nn.CrossEntropyLoss(which combines nn.LogSoftmax() and nn.NLLLoss() in one single class.). Why?

tano297 commented 5 years ago

Hi.

nn.CrossEntropyLoss acts on the activation logits, doing the logarithm and the softmax for you. In my organization, I do the softmax inside the "segmentator" class, and the logarithm right before calling the loss function, so the behavior is the same using nn.NLLLoss.

Let me know if this is clear or I'll send you links to the lines when I'm in my computer

BenjaminYoung29 commented 5 years ago

Yes it's very clear to me! Thanks!

BenjaminYoung29 commented 5 years ago

Hi.

nn.CrossEntropyLoss acts on the activation logits, doing the logarithm and the softmax for you. In my organization, I do the softmax inside the "segmentator" class, and the logarithm right before calling the loss function, so the behavior is the same using nn.NLLLoss.

Let me know if this is clear or I'll send you links to the lines when I'm in my computer

Hi.

I need to count the FLOPs of my network. Do you have any good idea on doing this?

Thanks.

tano297 commented 5 years ago

I would try with something like this https://github.com/Lyken17/pytorch-OpCounter

BenjaminYoung29 commented 5 years ago

I would try with something like this https://github.com/Lyken17/pytorch-OpCounter

I have tried it out. Ant it works fine. Here is the thing to be noticed.

First pip install thop.

Then in segmentator.py

`

from thop import profile

input=torch.randn(1, 5, 64, 512)  #change 512 into the width of the input in yaml files
device = torch.device("cuda")
input = input.to(device)
self.decoder.cuda()
self.head.cuda()
flops, params = profile(self, inputs=(input, ))
flops, params = clever_format([flops, params], "%.3f")
print("FLOPS: ", flops)
print("Total params: ", params)

`

And I found a question while doing this, which is in your code you changed self.backbone to cuda. However you didn't do the same for self.decoder or self.head. Why is that?

tano297 commented 5 years ago

Hi,

The backbone is changed to cuda there because it is being profiled with a fake input to get the shape of the skip connections (needed by the decoder to define its internal structure). Since backbone, decoder, and head are nn.Modules members of the 'Segmentator' class, they all go to cuda when I do self.model.cuda here. This is standard pytorch behavior, when you call functions such as .train(), .eval(), .cuda(), .cpu(), etc, on a module, it calls it for all the children

BenjaminYoung29 commented 5 years ago

Hi,

Do you know what will affect the inference time of the network. I find it puzzling that when i set the batch size to 48( trained on 2 1080Ti), the inference fps of my own network was 96 and when I set it to 8, the fps was 60. And I've run your DarkNet53-512px, the inference fps was 50fps. The number of parameter of DarkNet53 is 50M, while mine is 4M. And our inference time differs little which is not emprical.

tano297 commented 5 years ago

Hi,

Batching always helps the fps. here is a good post from nvidia with an fps vs batch size analysis. This is because each kernel launches once for the same layer in each image in the batch, rather than multiple times. This helps use the GPU closer to 100% of its utilization. (click on inference)

In terms of the number of parameters, less is not always better. The darknet backbone is specially designed in the Yolo paper to maximize GPU utilization, by using simple operators that are implemented very fast, and not so many layers sequentially, which adds dead times which affect gpu utilization. Therefore the relationship between flops vs time, or parameters vs time is NEVER linear. This is not just for this framework, but for every GPU based application (especially deep CNNs)

BenjaminYoung29 commented 5 years ago

Hi,

Thanks a lot for all your enlightening reply. They helps me a lot. I realized that the validation was run with a batch size that is bigger than 1, which makes the result reasonable. But in real life LiDAR collects point cloud at about 10Hz and the embedding device on a car processes the input one by one. So maybe it would be more reasonable to compare the fps at batchsize 1.

Thanks agian for your help these days.

tano297 commented 5 years ago

Yes, batch 1 comparison makes the most sense. On the paper, all the experiments are run with batch 1, using the inference script, and adding traces to calculate means and stds of runtimes. This code is not on the repo since it's only necessary to generate results for the paper

PRBonn / lidar-bonnetal

Loss function problem #13