Loss in Pytorch and in Caffe

tlkvstepan commented 6 years ago

@ClementPinard Did you cross-checked losses during training in caffe and pytorch? Are they similar? In my case (DispNetCorr1) in pytorch I have losses that are 100x larger than in caffe. I think I use the same averaging over batches and pixels as in caffe but they are still different. The weird thing is that I in case of DispNetCorr1 disparities are not normalized, i.e they are in range from [0 ... 250] so I expect to get high losses in the begging of training (in order of 10-100), but they are still small in caffe log.

ClementPinard commented 6 years ago

I did not study DispNet, but cross-checking of loss has been done in #17 Basically, the L1Loss layer in the caffe implementation of FlowNet-DispNet is a custom one, with the option normalize_by_num_entries deactivated. which means even if flow is divided by 20, if your image has a lot of pixels, you'll end up with very high loss.

But maybe for DispNet, this option is selected, hence the smaller loss. I'll go check the train prototxt.

ClementPinard commented 6 years ago

from train.prototxt of DispNetCorr1D :

layer {
  name: "flow_loss6"
  type: "L1Loss"
  bottom: "blob31"
  bottom: "blob30"
  top: "flow_loss6"
  loss_weight: 0.2
  l1_loss_param {
    l2_per_location: false
    normalize_by_num_entries: true
  }
}

normalize_by_num_entries is selected, and loss_weight is 0.2, which means even you are completely wrong, max loss will be around 0.2*255 (~50, which will go down to less than 10 pretty quickly) Is it what you get ?

tlkvstepan commented 6 years ago

I think you conclusions are wrong. There are actually 6 losses for different scales, and their total weight is equivalent to 2, so the loss should be much larger... After long searches and experiments I figured out that most probably during training disparity is divided by 32.. It is done not in train.prototxt, but probably in custom_data_layer. Most notably the original networks provided in /model are seems to trained in different way.

tlkvstepan commented 6 years ago

            case DataParameter_CHANNELENCODING_UINT16FLOW:
            for(int c=0; c<channel_count; c++)
                for(int y=0; y<height; y++)
                    for(int x=0; x<width; x++)
                    {
                        short v;
                        *((unsigned char*)&v)=*(srcptr++);
                        *((unsigned char*)&v+1)=*(srcptr++);

                        Dtype value;
                        if(v==std::numeric_limits<short>::max()) {
                          value = std::numeric_limits<Dtype>::signaling_NaN();
                        } else {
                          value = ((Dtype)v)/32.0;
                        }

                        *(destptr++)=value;
                    }
                break;

tlkvstepan commented 6 years ago

This is very confusing... Authors should not have placed this normalization in data loader...

tlkvstepan commented 6 years ago

Hope this will be useful! I think same data loader is used for flow as well

ClementPinard commented 6 years ago

Well, i did not try DispNet at all, so I did not make any conclusions ¯\(ツ)/¯ I was just pointing some ways for you to search what could be different from FlowNet.

It's a shame that in caffe code training schedule is not consistent between the two networks which are yet very similar. However this division may occur here because real disparity (consistant with pixel unit) is not the raw downloaded one (while for Flow Maps it was). Thus disparity coming out of the data loader would be real one because when originally written, it was multiplied by 32.

I'd be glad to implement a dispnet training script when I have some more time, but searching good hyperparameters for flownet training (which are in the end not the ones used by original paper !) was really tedious, i am not sure I'll be willing to do the same search ^^

tlkvstepan commented 6 years ago

Perhaps you are right, since you apparently did not experience similar strange mismatch between caffe and pytorch loss. I will recheck for flownets. Thank you!

2017-12-04 12:00 GMT+01:00 Clément Pinard notifications@github.com:

Well, i did not try DispNet at all, so I did not make any conclusions ¯\ (ツ)/¯ I was just pointing some ways for you to search what could be different from FlowNet.

It's a shame that in caffe code training schedule is not consistent between the two networks which are yet very similar. However this division may occur here because real disparity (consistant with pixel unit) is not the raw downloaded one (while for Flow Maps it was). Thus disparity coming out of the data loader would be real one because when originally written, it was multiplied by 32.

I'd be glad to implement a dispnet training script when I have some more time, but searching good hyperparameters for flownet training (which are in the end not the ones used by original paper !) was really tedious, i am not sure I'll be willing to do the same search ^^

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ClementPinard/FlowNetPytorch/issues/20#issuecomment-348929245, or mute the thread https://github.com/notifications/unsubscribe-auth/ALJswzE1gmAAoAAvQFrsW2VEJaBRNYvUks5s89DfgaJpZM4QvvwK .

ClementPinard / FlowNetPytorch

Loss in Pytorch and in Caffe #20