Closed LPXTT closed 1 year ago
Hi @LPXTT,
Initially, we removed the downsampling layers from the VGG architecture in order to get direct pixel-accurate descriptors and keypoints (instead of relying on bilinear upsampling + cell-based softmax). This indeed increases our flop, and is an opportunity for trying more efficient architectures.
It should be fairly straightforward to use the SuperPoint backbone we implemented. One caveat though is that you will have to obtain the keypoint scores using a sigmoid instead of a softmax (which is the original SuperPoint "cell-based" design).
Please, don't hesitate to ask if you have technical questions regarding this change.
Hi Gleize, thanks for your quick reply! I've got a quick question about adding max-pooling layers. In your original code, the output from the backbone is a bit smaller than the input, like 200 vs 184, because of the kernel size and no padding, right? Now, if I switch to SuperPoint's backbone, the output size becomes a quarter of the input size. Do I need to tweak the post-processing or keypoints matching parts because of this size change? Could you point me to which files might need adjustments? Thanks a ton!
Hi @LPXTT,
In your original code, the output from the backbone is a bit smaller than the input, like 200 vs 184, because of the kernel size and no padding, right?
Yes.
Do I need to tweak the post-processing or keypoints matching parts because of this size change? Could you point me to which files might need adjustments?
Sure.
First, the output descriptors should be at the same resolution as the input since it is upsampled here.
However, because the name of descriptor flow node changes between SuperPoint and SiLK (normalized_descriptors
vs upsampled_descriptors
), you need to point to the correct node name here (simply replace normalized_descriptors
by upsampled_descriptors
). The node transition is defined here).
Second, the linear mapping will be incorrect, and not retrievable in the case of SuperPoint. However, since the resolution shouldn't change, you can set it to an identity mapping here :
coord_mapping = silk.backbone.silk.coords.Identity()
You need to swap the two first transitions here and replace them by something like that :
flow.define_transition(
f"{prefix}spatial_logits",
partial(depth_to_space, cell_size=cell_size),
detector_head_output_name,
)
flow.define_transition(
f"{prefix}score",
logits_to_prob,
f"{prefix}spatial_logits",
)
This will spatially flatten the cell-based logits, then it should run the sigmoid to get the score (unlike SuperPoint which uses a softmax inside the cell).
And finally, you will have to point to the newly created logit node by changing "logits" to "spatial_logits" here.
I haven't tried those changes, so I might have forgotten something, but at least it should unblock you a bit.
Thanks! It is really helpful!
@gleize Hi, I have another problem. The training loss became NaN at about Epoch 9. I found there is a hyperparameter named 'block_size'. Its default value is 5400. Do I need to change it because of the resolution variations (My input image's size is 384*576)? Thanks!
Hi @LPXTT,
The block_size
is just a parameter used to make the large similarity matrix fit in memory (it is processed in blocks of fixed sizes). It should not affect the results at all, only the training speed and VRAM used.
We've never had a NaN
value during our trainings, so I suspect this could be a bug coming from the recent changes you've made. Do you know what layer causes this ? You could monitor the min / max values of the gradients at each layers over time and try to identify when and where it's causing issues.
My input image's size is 384*576
Just to confirm. This resolution you mention is the resolution right before the random crop right ?
The resolution is after random crop . I changed [164, 164] to [384, 576] in train-silk.yaml. I have checked the input size of the backbone. It is [384, 576].
Could you revert back to [164, 164]
and try again ? Just to identify if the issue you're having could be caused by the resolution change or not.
Also, did you check that the features and logits you feed to the loss are correct ? (proper shape, etc)
Hi Gleize, I trained models with your default setting before and they worked fine. I think the problem comes from the backbone and post-processing changes. I need to locate the issue but have one problem. I am not familiar with jax. I saw you use it in your code. How to check the value of loss at return loss_0.mean(), loss_1.mean(), precision.mean(), recall.mean() ? I try to print loss_0 and it is ShapeArray. I don't know how to see the value.
That JAX function is JIT compiled. That's why you get a ShapeArray
with no value in it.
You can disable the JIT globally (c.f. here), which is convenient for debugging.
Thanks!
Hi Gleize, in the event that my training session is interrupted and I need to resume, do I need to make any modifications to the configuration file, like train-defaults.yaml? I have set continue_from_checkpoint to True and pointed the model parameter to the path of the latest checkpoint. Is this the correct procedure for continuation? Thanks!
BTW, is there any strategy to speed up the training process? After changing the backbone and using my training data, it takes over 9 days to train 100 epochs.
Hi @LPXTT,
I have set continue_from_checkpoint to True and pointed the model parameter to the path of the latest checkpoint. Is this the correct procedure for continuation?
No, this is incorrect. You need to set the continue_from_checkpoint
parameter to the checkpoint file you want to load, not True
. If you search for continue_from_checkpoint
in the cadebase, you will find the part of the code that does the loading here.
BTW, is there any strategy to speed up the training process? After changing the backbone and using my training data, it takes over 9 days to train 100 epochs.
The most likely explanation for the slowdown is the large input resolution you're using. In the paper, we've shown the training time per different resolutions, and the default resolution we use trains in about 5 hours.
If your GPU has enough memory, you could try to increase the block_size
parameters. Increasing it will decrease the number of iterations, but will require more GPU memory (which might blow-up).
Another alternative is to use a smaller image resolution. We've shown that the performance saturates as the input resolution increases.
Got it! Thanks!
Hey, I've been messing around with your method for my project, and it's been pretty cool so far. Quick thing though, I'm kinda stuck with this FLOP situation. Even with just one layer in the models, the FLOPs are way higher than Superpoint. I'm thinking it's probably because we're not doing any downsampling. Does that make sense? Have you tried using Superpoint's backbone, the VGG-4, with your method? I'm really curious about how that went if you did. I'm thinking of switching to Superpoint's backbone to cut down on the FLOPs. Saw some Superpoint code in there, and I'm wondering if it's easy to switch them out? Would love to hear your thoughts when you get a chance. Thanks a ton!