Details about model structure & implementation

ZhangxinruBIT / MoME

A Foundation Model for Brain Lesion Segmentation

Apache License 2.0

7 stars 0 forks source link

Hi there, thanks for sharing this interesting work. But I got some questions regarding the detailed implementation and methodology.

Does the "specialisation loss Li" in equation 2 of the paper refer to the segmentation loss given by the one modality out of its own expert?
Clearly MoME is a two-stage work. In stage 1, the experts are trained. In stage 2, the experts are used to initialize MoME and later MoME is trained by the curriculum learning strategy. If my understanding of question 1 is correct, why does the "specialisation loss Li" still need to be emphasized at the beginning of the training even though it (each modality's expert) should already be well optimized in stage 1? Maybe it is because you want the weight of expert won't change much at the beginning of the training but why not just assign a much lower learning rate to the experts' parameters. Also, does the performance degrade a lot if the two-stage training is not applied?
I see that MoME outperformed nnUNet (in foundation model setting) a lot in the comparison. But how about just increase the size of nnUNet (since its size should be larger to handle multiple dataset as a foundation model)? What will happen if it shares the same GPU memory usage as MoME (and there is no typical routing system of MoME as a MoE model so it does cost much more computationally beyond the GPU memory usage)?
It kind of surprises me that MoME can't outperform Task-specific nnU-Nets in terms of trained dataset but it can do better on unseen dataset (Tumour1 & Tumour2). Do you have any insights about it?

It will be awesome if you could kindly answer such questions for me.

Hi,

Regarding points 1 and 2, yes, it is indeed a two-stage process. Our goal is to balance specialization and collaboration. The loss function transitions from individual ( Li ) (specialization) to ( L{MoME} ) (collaboration). This shift should be gradual and transitional; an abrupt change could cause the experts' performance to degrade. Due to space limitations, we couldn't include all our experiments. For example, fixing the experts' weights and only training the gating network in the second stage didn't yield as good results as our proposed solution. Additionally, experimenting with some warm-up settings related to the learning rate could be worthwhile if possible.

For point 3, the current patch size of the image in nnUNet is (128, 128, 128) with five encoder layers, resulting in a bottleneck feature shape of (4, 4, 4) (128/2^5). If we deepen the network, this patch size might not be suitable as it would make the feature map too small at the bottleneck layer. However, you can explore enlarging the nnUNet by increasing the number of convolutional layers and features.

For point 4, the seen dataset is much larger than the unseen dataset. Remember that MoME, even when tested on the unseen dataset, learns some knowledge from the seen dataset (e.g., for tumor, BraTS2021 datasets). In contrast, task-specific nnU-Nets are trained solely on the unseen datasets, meaning they do not benefit from the knowledge gained from the seen datasets, as they are specifically trained only on the unseen dataset.

Thank you.

ZhangxinruBIT / MoME

Details about model structure & implementation #2