Closed milliema closed 2 years ago
Thank you for your interest. First, note that input repetition is never used in MixMo: it generally decreases diversity across subnetworks. Yet, this implementation difference may also have some impact on the batch repetition. Intuitively, I speculate that applying different pixels augmentation on the different duplicates of the same image may increase diversity across subnetworks. This coud partially explain why our MIMO baseline is stronger that the Google implementation. Yet, this gain may remain as long as the augmentation process is not too strong/destructive. In that case, we would lose the benefit from batch repetition/input repetition. Please let me know if you have a try. Finally, this discussion is quite related to the variance reduction discussion from this inspiring paper: https://arxiv.org/pdf/1901.09335.pdf
Thanks for your reply, it's really very helpful. Indeed, differed data augmentation (DA) has similar effect as Batch Augmentation (thx for the sharing), and is beneficial for the performance. You just remind me that input repetition is not used in Mixmo, but used in MIMO. Since the main idea of Mixmo is to enhance model diversity (through mixing block), it's reasonable to disable input repetition. According to my understanding, both lower input repetition and differed data augmentation (DA) are capable of encouraging model diversity. I actually have done some experiments on my dataset. If differed DA is used, the best setting of input repetition is 0.9 in MIMO, which is much higher than suggested in the paper, and the achieved result is better than single model. I havn't try on Mixmo yet. I'll update you of any interesting results. Thanks!
Thanks for the great work! I'm currently trying to adopt Mixmo for my own projects, I found some modifications that differs from Google'work MIMO. For input batches generation, suppose we use input repetition=1.0, technically the indexes and images for the 2 inputs of 2 experts should be exactly the same. In goolge's implementation, they read the images first, and then construct 2 inputs for the experts based on input repetition value (to shuffle partial indexes and keep the rest unchanged). In your implementation, you compute the indexes first (based on input repetition), and then read the images accordingly. The problem is, if we use default data augmentation (e.g. random cropping or flipping, which is also used in your code), even if the indexes are the same, it doesn't mean the images are the same because we apply DA to these imgaes randomly! However, in google's implementation, since the images are read in at first, this issue does not exist. I hope I've made my ideas clearly. I'm wondering that, how would this affect the performance, and which implementation is correct or reasonable? I'd like to hear your opinions. Appreciate your prompt reply!