Hello and thank you for providing us with the code of your paper. I have been experimenting with the 3D case and I have some questions regarding parts of the code. Specifically:
I can see that the conditional memory bank and weighting is applied in the 2D case, but I cannot find the code for that in the inference mode of the 3D cases. Does that mean that it is not applied on these cases?
I have noticed that for the inference mode the video_length is fixed, always divided by 4. Is there any specific reason for that or could we use up to the total number of frames?
I have also observed that during inference mode the number of given prompts is equal distributed depending on the prompt frequency. Have you noticed any changes in the performance of Medical-SAM2 in one prompt segmentation cases for specific slices of the CT, f.e., the middle slice? If so do you have any recommendations on how to choose the best slice?
Lastly in your provided code it seems that you chose to freeze the image and prompt encoders and only train the rest of the model. Is there any specific reason for that? (yielded better results maybe?)
It is common in papers that perform fine-tuning of an existing model to freeze the encoder. It is usually the heaviest component in these foundation models so also the most GPU-intensive to train.
Hello and thank you for providing us with the code of your paper. I have been experimenting with the 3D case and I have some questions regarding parts of the code. Specifically: