Closed LezJ closed 4 months ago
Hi,
Regarding points 1 and 2, yes, it is indeed a two-stage process. Our goal is to balance specialization and collaboration. The loss function transitions from individual ( Li ) (specialization) to ( L{MoME} ) (collaboration). This shift should be gradual and transitional; an abrupt change could cause the experts' performance to degrade. Due to space limitations, we couldn't include all our experiments. For example, fixing the experts' weights and only training the gating network in the second stage didn't yield as good results as our proposed solution. Additionally, experimenting with some warm-up settings related to the learning rate could be worthwhile if possible.
For point 3, the current patch size of the image in nnUNet is (128, 128, 128) with five encoder layers, resulting in a bottleneck feature shape of (4, 4, 4) (128/2^5). If we deepen the network, this patch size might not be suitable as it would make the feature map too small at the bottleneck layer. However, you can explore enlarging the nnUNet by increasing the number of convolutional layers and features.
For point 4, the seen dataset is much larger than the unseen dataset. Remember that MoME, even when tested on the unseen dataset, learns some knowledge from the seen dataset (e.g., for tumor, BraTS2021 datasets). In contrast, task-specific nnU-Nets are trained solely on the unseen datasets, meaning they do not benefit from the knowledge gained from the seen datasets, as they are specifically trained only on the unseen dataset.
Thank you.
Hi there, thanks for sharing this interesting work. But I got some questions regarding the detailed implementation and methodology.
stage 1
, the experts are trained. Instage 2
, the experts are used to initialize MoME and later MoME is trained by the curriculum learning strategy. If my understanding of question 1 is correct, why does the "specialisation loss Li" still need to be emphasized at the beginning of the training even though it (each modality's expert) should already be well optimized instage 1
? Maybe it is because you want the weight of expert won't change much at the beginning of the training but why not just assign a much lower learning rate to the experts' parameters. Also, does the performance degrade a lot if the two-stage training is not applied?It will be awesome if you could kindly answer such questions for me.