Master-cai commented 1 year ago

Hi! Thanks for your great work! But I don't quite understand some parts of the code

About the training and results

How much does the pre-training affect the final performance ? Have you tried to directly train LightGlue on MegaDepth (i.e., no pre-training)?
As you stated in C.1. Architecture.Confidence classifier: "... and its gradients are not propagated into the states to avoid impacting the matching accuracy." Why the classifier's gradients will impacting the matching accuracy? have you conducted experiments about it?
The training of classifier is after the "pre-training" or the "fine-tuned with the MegaDepth"?

some details in code
Why is an additional row/column added to the log assignment matrix? I don't see it in the paper. Even not used in this code. code.
Why assert mask is None when using FlashCrossAttention() which is self-installed ? (FlashCrossAttention also support maskparameter) code
In CrossBlock , when flash is enable, you do not use Bidirectional Cross Attention actually, right ? code . If so, what's the reason ?
Why do not calculate the scores1 by scores1 = F.log_softmax(sim, 1) but in a more complicated way (scores1 = F.log_softmax(sim.transpose(-1, -2).contiguous(), 2).transpose(-1, -2)) ? code.

Thanks for your time! Looking forward to your reply!

Phil26AT commented 1 year ago

Hi @Master-cai, thank you for opening this issue!

It certainly helps a bit in out-of-distribution benchmarks (like HPatches), but less on MegaDepth1500.
The network could learn to focus more on the confidence prediction than on the matching accuracy, which, in the end, is the major thing we are interested in. This could be solved by sophisticated loss balancing, but we decided to avoid this tedious tuning by not propagating gradients, which we found to be sufficiently accurate as-is.
We trained the confidence classifier on both tasks, but I expect that just training it on MegaDepth is sufficient. Note that training this classifier is self-supervised, so it can be fine-tuned on any set of image pairs without ground truth labels.
Indeed, but this is just a difference in notation. We store the non-matchability in the extra row/column. While it is not required during inference, it is required during training (Eq. 11).
This is indeed not required. We will update the code, thank you for this hint!
The problem is that bidirectional flash attention does not exist, but running flash attention twice (with shared Q/K) is faster than running bidirectional attention. We tried to implement bidirectional flash attention with Triton but did not succeed in achieving significant speedups until now. However, we do believe that there is a big opportunity for speedups there!
If I remember correctly this is slightly faster (because of less strided memory accesses in softmax). I agree that they are mathematically equivalent.

Master-cai commented 1 year ago

@Phil26AT Thank you for your quick and patient response !

cvg / LightGlue