Gaussian kernel size and SQRT inverse reweighting for LDS

yannbouteiller commented 3 years ago

Hi, congratulations for your ICML paper, it sounds very useful and I loved the insight of Figure 2. I am trying to implement the paper right now in one of my projects. I have a couple questions regarding LDS if you don't mind me asking here.

First, I am a bit puzzled at this line in your code:

https://github.com/YyzHarry/imbalanced-regression/blob/b7fa50293b1e29d20d4b4e1136c5c1d8c39c60fa/agedb-dir/utils.py#L115

If I understand correctly, you are using gaussian_filter1d to create a gaussian kernel of a small size (e.g. 5 in the paper) and then you convolve this with the label distribution using convolve1d. But isn't gaussian_filter1d supposed to do this (with the full window as the kernel) in the first place? Looking on the Internet I find that the reason why people use small gaussian kernels in e.g. image processing is usually computational : after a width of about 3 standard deviation, a larger kernel would be useless. However, in the paper, it appears that you actually get better results with small kernels? Could you elaborate on this a little bit please?

My second question is about this line:

https://github.com/YyzHarry/imbalanced-regression/blob/b7fa50293b1e29d20d4b4e1136c5c1d8c39c60fa/agedb-dir/datasets.py#L65

In the paper, I found a place where you talk about reweighting the loss by something proportional to the inverse of the smoothed label distribution (Algorithm 1 in the appendix), but nothing about this reweighting by the inverse sqrt as you seem to be doing here by default. Could you also elaborate a bit on this, please?

Thank you for your time!

YyzHarry commented 3 years ago

Thanks for your interest! These are good questions.

But isn't gaussian_filter1d supposed to do this (with the full window as the kernel) in the first place?

Implementation-wise, this is true. In our case, we explicitly define a kernel with two parameters, kernel_size and sigma. I feel this could be more clear for the following reasons:

It decouples the choice of kernel_size and sigma. If you directly apply gaussian_filter1d on the label distribution, you can tune only the sigma (although you might get similar results, it might not give you the ability for the "cut off" at the kernel boundary).
It might be more intuitive and aligned with the definition. Such implementation might also be easier when applied to the feature smoothing part.

But I do agree you can directly use gaussian_filter1d for implementation!

it appears that you actually get better results with small kernels

The choice of kernel size is related to the task of interest. For example, in age estimation, with the minimum bin size of 1, you would not expect a very large kernel size considering similar nearby ages. Again, decoupling kernel_size and sigma might be also useful here to choose a good set of parameters. In Appendix E.3 of our paper, we studied some choices of kernel_size and sigma, and it might give you some sense on what values are good for different tasks.

about this reweighting by the inverse sqrt

Actually, we use SQRT-INV by default for certain tasks (like age estimation). The details of these baselines could be found on Page 6 of the paper. Either SQRT-INV or INV belongs to the category of cost-sensitive re-weighting methods; the reason to use sqrt inverse sometimes, is because after inverse re-weighting, some weights might be very large (e.g., consider 5,000 images for age 30, and only 1 image for age 100, then after inverse re-weighting, the weight ratio could be extremely high). This could cause optimization problems. Again, the choices also depend on tasks you are tackling.

Let me know if you have any questions!

yannbouteiller commented 3 years ago

Thank you for your answer, I see the point of using the inverse sqrt now!

The reason why using "cut" gaussian kernels seems to perform better than using full gaussians with small sigmas is still unclear to me though, but I guess it just works. As far as I understand, this kernel is essentially an empirical way of smoothing the label distribution and making it more correlated to the empirical error, right? Which yields better weights to rescale the loss, for some reason that I am not well-versed enough in the unbalanced dataset literature to really grasp, but that sounds interesting..

Thanks for your work!

YyzHarry / imbalanced-regression

Gaussian kernel size and SQRT inverse reweighting for LDS #5