Closed Cogito2012 closed 3 years ago
We alternatively minimize and maximize HSIC for f and g, respectively.
minimize_f HSIC(f, g)
=> It is what we really want. Note that in this process, f is only updated parameters.maximize_g HSIC(f, g)
=> We also maximize HSIC for g, to make g catch up fNote that we let g
have limited capacity comparing to f
, e.g., smaller receptive field (rf).
Our minimax optimization scheme enforces that f is different from the worst-case g.
For example, assume that g
can have potentially 32x32 rf and f
can have 64x64.
Assume current g
only can express 3x3 rf (even though it has enough capacity to express larger rf).
In this case, only minimize_f HSIC(f, g)
may make smaller rf (e.g., 5x5) than we expect (64x64).
However, if we apply maximize_g HSIC(f, g)
, g will enlarge its rf. At the same time, minimize_f HSIC(f, g)
will make f
be de-biased from g.
Please see the paper and the video for the detailed explanation: https://youtu.be/lkjMxZDGubA?t=210
Yes, it could be redundant, we assume that there could be more than one g
(while in the paper, we only show results with a single g). Rigorously, this loop should return the list of g dimensions, but in practice, since we use the same architecture for all g networks, we only return a single dimension value.
@SanghyukChun Thanks a lot for your timely and detailed explanation!
If I understand correctly, the worse-case g means that g will reach its best capability for classification (large rf, or small loss L(g)
), i.e., maximize_g HSIC(f, g)-L(g)
, at the same time, we also encourage such g to be biased in terms of f, i.e., optimizing f by minimize_f HSIC(f, g) + L(f)
, right?
If so, can we just make it simple by minimizing f and g synchronously? For example, min_{f,g} L(f) + L(g) + HSIC(f, g)
. What's the potential drawback of this idea as opposed to your paper's method? If your method is not alternatively updated, i.e., jointly optimizing f and g with a single loss, the only difference is thus the additional maximize HSIC(f, g)
term which makes g catch up f. So, if L(f)
and L(g)
are minimized, why g may not catch up with f?
Forgive my long questions! Looking forward to your discussions. Thanks!
Let me rephrase my concern. Can we simplify the alternative updating as minimizing the objective?
L(f) + L(g) + HSIC(f, \fixed{g}) - HSIC(\fixed{f}, g)
where \fixed{f}
or \fixed{g}
means the network parameters of f
or g
will not be updated by corresponding HSIC term.
If this is also working, then this kind of ReBias variant may be even easier to use.
@Cogito2012 In theory, you have to alternatively update to correctly solve a minimax problem, which aims to find a saddle point:
If you jointly optimize min and max problems at the same time, it will not guarantee a correct saddle point. I even cannot sure that such optimization will be converged to a certain point.
Note that our conceptual objective function is
\min_f [L(f) + \max_g HSIC(f, g)]
which have to solve alternative updates for f and g, respectively.
Such minimax problem is popular in many machine learning methods such as
@SanghyukChun You're right, I'm now clear about this problem.
In theory, a minmax optimization problem indeed needs an alternative updating strategy to reach a correct saddle point. In practice, for large-scale deep neural networks, I think I'll also need to empirically verify if the joint training strategy works or not. Anyway, your insightful reply really helped me a lot!
Thank you for such a great work! When reading the paper and code, I have the following questions.
According to the definition of HSIC, it measures the level of independence and
HSIC(U, V) = 0
indicates that U and V are independent. A larger HSIC value indicates that U and V are dependent to some extent.So, to debias the representation of network f by using a biased network g, shouldn't we minimize HSIC(f, g)?
Besides, the for loop in line 62 seems redundant because the
g_dim
will be over-written by the last loop, right?Looking forward to your reply. Thanks!