Closed Divadi closed 6 months ago
Hi, thanks for your interest ans question!
@Nightmare-n Thank you so much for your earlier response. I think this is a strong work, so I've been trying to reproduce it! Unfortunately my results are not as good, and I wanted to get your guidance on some implementation details.
Specifically, I am trying to create this result:
Here is what I've tried.
First, I tried loading in ConvNeXt-S weights as in mmdetection, which are from the official repo. Training for 12 epochs on load_interval=2, it achieves 20.9/22.4 for NDS/mAP. I tried removing augmentations and got 22.7/25.1, which is still lower NDS than shown. I based my code on UVTR camera base, adding layer-wise LR decay.
Next, I saw you reference SparK pre-training, so I tried their pre-trained ConvNeXt-S weights. I also increased unified volume size to 180x180x5 (it's not mentioned what volume size you use for UVTR). This now gets 24.8/28.3 - NDS is still worse, but mAP is now too good...
Could I get guidance on the exact settings you used for this?
With this, training 6 epochs (I know this shouuld be 12, but for faster experiments) with load_interval=1, no augmentations, then fine-tuning UVTR (removing the 3 conv-bn-relu 3D CNNs), I get 26.9/31.7. Compared to 24.8/28.3 baseline, it's a +2.1/+3.4 improvement, but it's really far from the +7.7/+9.6 in paper. Notably, NDS doesn't improve too much...
EDIT: Trained for 12 epochs, achieves just 27.2/32.1
Do you have any ideas on what I can fix or what I may be missing?
Thank you so much in advance! If you are okay with having a more in-depth discussion, please feel free to email me directly as well.
Hi, the code is released!
Thank you for releasing this amazing work! I just had a couple questions on some of the camera-only outdoor details @Nightmare-n
Apologies for the list of questions, but I'm really interested in the work. Again, thank you so much in advance!