clovaai / rebias

Official Pytorch implementation of ReBias (Learning De-biased Representations with Biased Representations), ICML 2020
MIT License
171 stars 30 forks source link

Why is the HSIC not minimized but maximized? #6

Closed Cogito2012 closed 3 years ago

Cogito2012 commented 3 years ago

Thank you for such a great work! When reading the paper and code, I have the following questions.

According to the definition of HSIC, it measures the level of independence and HSIC(U, V) = 0 indicates that U and V are independent. A larger HSIC value indicates that U and V are dependent to some extent.

So, to debias the representation of network f by using a biased network g, shouldn't we minimize HSIC(f, g)?

Besides, the for loop in line 62 seems redundant because the g_dim will be over-written by the last loop, right?

Looking forward to your reply. Thanks!

SanghyukChun commented 3 years ago

We alternatively minimize and maximize HSIC for f and g, respectively.

Note that we let g have limited capacity comparing to f, e.g., smaller receptive field (rf). Our minimax optimization scheme enforces that f is different from the worst-case g.

For example, assume that g can have potentially 32x32 rf and f can have 64x64. Assume current g only can express 3x3 rf (even though it has enough capacity to express larger rf). In this case, only minimize_f HSIC(f, g) may make smaller rf (e.g., 5x5) than we expect (64x64). However, if we apply maximize_g HSIC(f, g), g will enlarge its rf. At the same time, minimize_f HSIC(f, g) will make f be de-biased from g.

Please see the paper and the video for the detailed explanation: https://youtu.be/lkjMxZDGubA?t=210


https://github.com/clovaai/rebias/blob/7b455fb8b9e5bcd67bf3a7c1812625b1f908ad55/criterions/sigma_utils.py#L62-L65

Yes, it could be redundant, we assume that there could be more than one g (while in the paper, we only show results with a single g). Rigorously, this loop should return the list of g dimensions, but in practice, since we use the same architecture for all g networks, we only return a single dimension value.

Cogito2012 commented 3 years ago

@SanghyukChun Thanks a lot for your timely and detailed explanation!

If I understand correctly, the worse-case g means that g will reach its best capability for classification (large rf, or small loss L(g)), i.e., maximize_g HSIC(f, g)-L(g), at the same time, we also encourage such g to be biased in terms of f, i.e., optimizing f by minimize_f HSIC(f, g) + L(f), right?

If so, can we just make it simple by minimizing f and g synchronously? For example, min_{f,g} L(f) + L(g) + HSIC(f, g). What's the potential drawback of this idea as opposed to your paper's method? If your method is not alternatively updated, i.e., jointly optimizing f and g with a single loss, the only difference is thus the additional maximize HSIC(f, g) term which makes g catch up f. So, if L(f) and L(g) are minimized, why g may not catch up with f?

Forgive my long questions! Looking forward to your discussions. Thanks!

Cogito2012 commented 3 years ago

Let me rephrase my concern. Can we simplify the alternative updating as minimizing the objective?

L(f) + L(g) + HSIC(f, \fixed{g}) - HSIC(\fixed{f}, g)

where \fixed{f} or \fixed{g} means the network parameters of f or g will not be updated by corresponding HSIC term.

If this is also working, then this kind of ReBias variant may be even easier to use.

SanghyukChun commented 3 years ago

@Cogito2012 In theory, you have to alternatively update to correctly solve a minimax problem, which aims to find a saddle point:

If you jointly optimize min and max problems at the same time, it will not guarantee a correct saddle point. I even cannot sure that such optimization will be converged to a certain point.

Note that our conceptual objective function is

\min_f [L(f) + \max_g HSIC(f, g)]

which have to solve alternative updates for f and g, respectively.

Such minimax problem is popular in many machine learning methods such as

Cogito2012 commented 3 years ago

@SanghyukChun You're right, I'm now clear about this problem.

In theory, a minmax optimization problem indeed needs an alternative updating strategy to reach a correct saddle point. In practice, for large-scale deep neural networks, I think I'll also need to empirically verify if the joint training strategy works or not. Anyway, your insightful reply really helped me a lot!