Closed yannbouteiller closed 3 years ago
Thanks for your interest! These are good questions.
But isn't
gaussian_filter1d
supposed to do this (with the full window as the kernel) in the first place?
Implementation-wise, this is true. In our case, we explicitly define a kernel with two parameters, kernel_size
and sigma
. I feel this could be more clear for the following reasons:
kernel_size
and sigma
. If you directly apply gaussian_filter1d
on the label distribution, you can tune only the sigma (although you might get similar results, it might not give you the ability for the "cut off" at the kernel boundary).But I do agree you can directly use gaussian_filter1d
for implementation!
it appears that you actually get better results with small kernels
The choice of kernel size is related to the task of interest. For example, in age estimation, with the minimum bin size of 1, you would not expect a very large kernel size considering similar nearby ages. Again, decoupling kernel_size
and sigma
might be also useful here to choose a good set of parameters. In Appendix E.3 of our paper, we studied some choices of kernel_size
and sigma
, and it might give you some sense on what values are good for different tasks.
about this reweighting by the inverse sqrt
Actually, we use SQRT-INV
by default for certain tasks (like age estimation). The details of these baselines could be found on Page 6 of the paper. Either SQRT-INV
or INV
belongs to the category of cost-sensitive re-weighting methods; the reason to use sqrt inverse sometimes, is because after inverse re-weighting, some weights might be very large (e.g., consider 5,000 images for age 30, and only 1 image for age 100, then after inverse re-weighting, the weight ratio could be extremely high). This could cause optimization problems. Again, the choices also depend on tasks you are tackling.
Let me know if you have any questions!
Thank you for your answer, I see the point of using the inverse sqrt now!
The reason why using "cut" gaussian kernels seems to perform better than using full gaussians with small sigmas is still unclear to me though, but I guess it just works. As far as I understand, this kernel is essentially an empirical way of smoothing the label distribution and making it more correlated to the empirical error, right? Which yields better weights to rescale the loss, for some reason that I am not well-versed enough in the unbalanced dataset literature to really grasp, but that sounds interesting..
Thanks for your work!
Hi, congratulations for your ICML paper, it sounds very useful and I loved the insight of Figure 2. I am trying to implement the paper right now in one of my projects. I have a couple questions regarding LDS if you don't mind me asking here.
First, I am a bit puzzled at this line in your code:
https://github.com/YyzHarry/imbalanced-regression/blob/b7fa50293b1e29d20d4b4e1136c5c1d8c39c60fa/agedb-dir/utils.py#L115
If I understand correctly, you are using
gaussian_filter1d
to create a gaussian kernel of a small size (e.g. 5 in the paper) and then you convolve this with the label distribution usingconvolve1d
. But isn'tgaussian_filter1d
supposed to do this (with the full window as the kernel) in the first place? Looking on the Internet I find that the reason why people use small gaussian kernels in e.g. image processing is usually computational : after a width of about 3 standard deviation, a larger kernel would be useless. However, in the paper, it appears that you actually get better results with small kernels? Could you elaborate on this a little bit please?My second question is about this line:
https://github.com/YyzHarry/imbalanced-regression/blob/b7fa50293b1e29d20d4b4e1136c5c1d8c39c60fa/agedb-dir/datasets.py#L65
In the paper, I found a place where you talk about reweighting the loss by something proportional to the inverse of the smoothed label distribution (Algorithm 1 in the appendix), but nothing about this reweighting by the inverse sqrt as you seem to be doing here by default. Could you also elaborate a bit on this, please?
Thank you for your time!