Potentially wrong gradient matrix

MaxiBoether commented 3 weeks ago

This is a duplicate of https://github.com/PatrickZH/DeepCore/issues/13.

In our AbstractMatrixDownsamplingStrategy implementation, we use the same matrix as defined by DeepCore. As discussed in this issue, the definition of the matrix looks weird, and I am also confused that we don't even need the actual gradient (with torch.no_grad()). We should check out whether this actually makes sense.

XianzheMa commented 2 weeks ago

on it

XianzheMa commented 1 week ago

After the potential fix PR, https://github.com/eth-easl/modyn/pull/540, I realize there could be a reason for their way of calculating the gradients.

Essentially, we should use the full gradients to select points. But since full gradients are expensive to compute, researchers propose to use the last layer gradients to approximate full gradients.

Since it is possible to use last layer gradients to approximate, it is also possible to use gradients of second-to-last layer, which is the gradients of loss regarding to the embedding. Since the mapping from embedding to output is linear, then the gradients of second-to-last layer is essentially last layer gradients scaled by the embedding layer, which is what they did.

MaxiBoether commented 1 week ago

I guess it's not fully clear to me yet:

1) From my understanding indeed, the papers proposing the initial techniques use the last layer gradients.

2) I guess it's not fully clear why what they do is equivalent to scaling the last layer gradient. For the bias, they take the last layer gradients, and concatenate that with the bias gradient * embedding? I don't immediately see how tat is equivalent to taking d last layer / d penultimate layer.

What is your proposal here? Should we keep the DeepCore implementation? Should we add a flag that allows to switch between DeepCore and your PR? I am not sure I fully understand the issue. My gut feeling is that we should follow the literature, but I wouldn't have a problem with adding a flag to switch between both implementations.

XianzheMa commented 1 week ago

I guess it's not fully clear to me yet:

From my understanding indeed, the papers proposing the initial techniques use the last layer gradients.

I guess it's not fully clear why what they do is equivalent to scaling the last layer gradient. For the bias, they take the last layer gradients, and concatenate that with the bias gradient * embedding? I don't immediately see how tat is equivalent to taking d last layer / d penultimate layer.

What is your proposal here? Should we keep the DeepCore implementation? Should we add a flag that allows to switch between DeepCore and your PR? I am not sure I fully understand the issue. My gut feeling is that we should follow the literature, but I wouldn't have a problem with adding a flag to switch between both implementations.

You are right we should anyway follow the original literature. I have explicitly read GradMatch paper and they use the last layer gradient. I will check the other policies as well. Also in Craig they did something (similarly wrong): https://github.com/eth-easl/modyn/blob/bf96bfa546c829a4844f25e0143497deb9539e04/modyn/trainer_server/internal/trainer/remote_downsamplers/remote_craig_downsampling.py#L110

Will check the paper and make changes if inconsistent with the paper! My proposal is yes, as you said, to follow the literature! Will continue finishing that PR.

XianzheMa commented 1 week ago

I guess it's not fully clear to me yet:

From my understanding indeed, the papers proposing the initial techniques use the last layer gradients.

I guess it's not fully clear why what they do is equivalent to scaling the last layer gradient. For the bias, they take the last layer gradients, and concatenate that with the bias gradient * embedding? I don't immediately see how tat is equivalent to taking d last layer / d penultimate layer.

What is your proposal here? Should we keep the DeepCore implementation? Should we add a flag that allows to switch between DeepCore and your PR? I am not sure I fully understand the issue. My gut feeling is that we should follow the literature, but I wouldn't have a problem with adding a flag to switch between both implementations.

I introduced a flag to switch between two forms. So bias gradient * embedding is the penultimate layer gradient. It is concatenated with the last layer gradient to somehow "enrich the information". I didn't find the claim in original literature on this form, but in @francescodeaglio 's thesis, there is a specific section justifying it

. So I regard it as reasonable.

eth-easl / modyn

Potentially wrong gradient matrix #505