Cecile-hi / Multimodal-Learning-with-Alternating-Unimodal-Adaptation

Multimodal Learning Method MLA for CVPR 2024
59 stars 6 forks source link

Questions about training on Food101 #9

Open Jiayu-Xiong opened 3 weeks ago

Jiayu-Xiong commented 3 weeks ago

I have observed that the command you provided for launching the training with the Food101 dataset is:

--lorb m3ae --modulation Normal --dataset Food101

However, in the class structure of M3AE (within the forward method), I noticed the following:

a = self.audio_fc(a)
v = self.visual_fc(v)

At the initialization stage, there is:

self.mae_a = M3AE(text_vocab_size = 30522, config_updates = model_config)
self.mae_v = M3AE(text_vocab_size = 30522, config_updates = model_config)
m3ae_ckpt_audio = "/path/to/m3ae_base_audio.pth"
m3ae_ckpt_visual = "/path/to/m3ae_base_visual.pth"

It seems that a pretrained model needs to be loaded. However, in reality, the Food101 dataset does not contain audio data, but instead includes text. Meanwhile, the args.dataset in this class appears to point to Food101. I am not entirely sure, but I speculate that this might be a typo and that the intended model might actually be CLIP.

In the case of the CLIP model, the method for reading in the dataset seems to rely not on the direct use of the dataset's index files, but rather on a specially processed file:

class CLIPDataset(Dataset):
    def __init__(self, args, mode='train'):
    ...
    if args.dataset == "Food101":
        self.data_root = '/data1/zhangxiaohui/food101/clip_feature/'

Additionally, the presence of files containing the 'my' keyword raises some uncertainty for me as to whether your code can directly work with the dataset in its original structure. My academic understanding may be somewhat limited, and I may not have fully captured every detail of your carefully designed approach. Therefore, I would like to kindly confirm with you: Are your experimental results all based on a pretrained model?

Mannix-D commented 3 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Jiayu-Xiong commented 3 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

I'm not sure, but before I can successfully train, I will footnote the data quoting the original text to avoid controversy. In fact, ImageNet 1K's pre-trained ResNet 152, along with pre-trained BERT Base as the backbone, achieves 92.67% accuracy with a simple weighted average. I have experimented with the QMF code and it is credible, provided you agree with the author's (1, 3) pooling.

If the author's M3AE is actually M3E, I think backbone should perform more than 93% on such a large-scale pre-trained and feature-aligned backbone. So actually running MLA, its performance theory should be slightly higher than the results in the article.

Mannix-D commented 3 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

I'm not sure, but before I can successfully train, I will footnote the data quoting the original text to avoid controversy. In fact, ImageNet 1K's pre-trained ResNet 152, along with pre-trained BERT Base as the backbone, achieves 92.67% accuracy with a simple weighted average. I have experimented with the QMF code and it is credible, provided you agree with the author's (1, 3) pooling.

If the author's M3AE is actually M3E, I think backbone should perform more than 93% on such a large-scale pre-trained and feature-aligned backbone. So actually running MLA, its performance theory should be slightly higher than the results in the article.

The QMF is absolutely conductible. I have reproduced the QMF and used it in my work. The code related to IEMOCAP is hard to conduct, and the loading model codes are of the wrong type. The "m3ae.pth" file can't be found on the web.

Jiayu-Xiong commented 3 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

I'm not sure, but before I can successfully train, I will footnote the data quoting the original text to avoid controversy. In fact, ImageNet 1K's pre-trained ResNet 152, along with pre-trained BERT Base as the backbone, achieves 92.67% accuracy with a simple weighted average. I have experimented with the QMF code and it is credible, provided you agree with the author's (1, 3) pooling. If the author's M3AE is actually M3E, I think backbone should perform more than 93% on such a large-scale pre-trained and feature-aligned backbone. So actually running MLA, its performance theory should be slightly higher than the results in the article.

The QMF is absolutely conductible. I have reproduced the QMF and used it in my work. The code related to IEMOCAP is hard to conduct, and the loading model codes are of the wrong type. The "m3ae.pth" file can't be found on the web.

It's not about whether it's true or not, it's not verifiable, just save yourself some time.

Of course, the authors don't have to prove this, nor can they.

Mannix-D commented 3 weeks ago

The question is "there is no such file named **.pkl' model, but they used it in their codes. I also have email them, but there is no response,

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

I'm not sure, but before I can successfully train, I will footnote the data quoting the original text to avoid controversy. In fact, ImageNet 1K's pre-trained ResNet 152, along with pre-trained BERT Base as the backbone, achieves 92.67% accuracy with a simple weighted average. I have experimented with the QMF code and it is credible, provided you agree with the author's (1, 3) pooling. If the author's M3AE is actually M3E, I think backbone should perform more than 93% on such a large-scale pre-trained and feature-aligned backbone. So actually running MLA, its performance theory should be slightly higher than the results in the article.

The QMF is absolutely conductible. I have reproduced the QMF and used it in my work. The code related to IEMOCAP is hard to conduct, and the loading model codes are of the wrong type. The "m3ae.pth" file can't be found on the web.

It's not about whether it's true or not, it's not verifiable, just save yourself some time.

Of course, the authors don't have to prove this, nor can they.

The question is: There is no file with the '**.pkl' model; however, they used it in their code. I have also emailed them, but there has been no response. Maybe they trained by themselves.

hubaak commented 2 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

Jiayu-Xiong commented 2 weeks ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

I'm not sure if what you're saying is true, nor have I done experiments outside of the Food-101 dataset, but I think it's reasonable that you can come up with this figure. A small piece of my current work is unified representational analysis, and I have found that there will be neuronal splitting in the dimensions reported in the article. Therefore, the actual representation dimension of each mode is almost less than 1/2 of the presupposition. You can refer to the UAVM's (https://github.com/YuanGongND/uavm/commits?author=YuanGongND) report.

I conducted a comparative experiment, consistent with the backbone of QMF, and got a conclusion of about 85. By comparison, concat is around 86 and the weight fixed 50% is around 92.

hubaak commented 1 week ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

I'm not sure if what you're saying is true, nor have I done experiments outside of the Food-101 dataset, but I think it's reasonable that you can come up with this figure. A small piece of my current work is unified representational analysis, and I have found that there will be neuronal splitting in the dimensions reported in the article. Therefore, the actual representation dimension of each mode is almost less than 1/2 of the presupposition. You can refer to the UAVM's (https://github.com/YuanGongND/uavm/commits?author=YuanGongND) report.

I conducted a comparative experiment, consistent with the backbone of QMF, and got a conclusion of about 85. By comparison, concat is around 86 and the weight fixed 50% is around 92.

I agree that the head may split into 2 parts for audio and visual when less than 50% of the dimension works well for each modality. I think that's why MLA had almost the same performance as UME in my experiments.

By the way, I think you can try some audio-visual datasets. The text encoder is too powerful on Food101(I got 86.7% with pre-trained Bert) so there is not much room for the visual encoder to cooperate with the text encoder.

Jiayu-Xiong commented 1 week ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

I'm not sure if what you're saying is true, nor have I done experiments outside of the Food-101 dataset, but I think it's reasonable that you can come up with this figure. A small piece of my current work is unified representational analysis, and I have found that there will be neuronal splitting in the dimensions reported in the article. Therefore, the actual representation dimension of each mode is almost less than 1/2 of the presupposition. You can refer to the UAVM's (https://github.com/YuanGongND/uavm/commits?author=YuanGongND) report. I conducted a comparative experiment, consistent with the backbone of QMF, and got a conclusion of about 85. By comparison, concat is around 86 and the weight fixed 50% is around 92.

I agree that the head may split into 2 parts for audio and visual when less than 50% of the dimension works well for each modality. I think that's why MLA had almost the same performance as UME in my experiments.

By the way, I think you can try some audio-visual datasets. The text encoder is too powerful on Food101(I got 86.7% with pre-trained Bert) so there is not much room for the visual encoder to cooperate with the text encoder.

I'm glad to get your approval. This also strengthened my determination to finish this work. Food-101 feeds too many people, and it's a classic example of what I want to illustrate. My paper is not to brush points, but to analyze the structural characteristics of a typical network. In this article I will cite and analyze its behavior. Therefore, I do not intend to change my evaluation plan. Your results are basically consistent with the results of QMF, and the result of my operation is 86.71%.

hubaak commented 1 week ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

I'm not sure if what you're saying is true, nor have I done experiments outside of the Food-101 dataset, but I think it's reasonable that you can come up with this figure. A small piece of my current work is unified representational analysis, and I have found that there will be neuronal splitting in the dimensions reported in the article. Therefore, the actual representation dimension of each mode is almost less than 1/2 of the presupposition. You can refer to the UAVM's (https://github.com/YuanGongND/uavm/commits?author=YuanGongND) report. I conducted a comparative experiment, consistent with the backbone of QMF, and got a conclusion of about 85. By comparison, concat is around 86 and the weight fixed 50% is around 92.

I agree that the head may split into 2 parts for audio and visual when less than 50% of the dimension works well for each modality. I think that's why MLA had almost the same performance as UME in my experiments. By the way, I think you can try some audio-visual datasets. The text encoder is too powerful on Food101(I got 86.7% with pre-trained Bert) so there is not much room for the visual encoder to cooperate with the text encoder.

I'm glad to get your approval. This also strengthened my determination to finish this work. Food-101 feeds too many people, and it's a classic example of what I want to illustrate. My paper is not to brush points, but to analyze the structural characteristics of a typical network. In this article I will cite and analyze its behavior. Therefore, I do not intend to change my evaluation plan. Your results are basically consistent with the results of QMF, and the result of my operation is 86.71%.

Looking forward to observing your work in the future! Currently, most of the work is based on concatenation, but concatenation only utilizes as many dimensions as the number of classes. I believe that a theoretical analysis of the encoder's output features will help us understand the underlying mechanisms of multimodal fusion.

Jiayu-Xiong commented 1 week ago

Bro, I suspect the author conducted experiments on only one dataset CREMA-D.

Yeah, and even for CREMA-D, they use different (epoch, batch size, lr) settings for MLA and the other methods. I think it's not a fair comparison as the final result really matters with these parameters for every method. I tried MLA on CREMA-D, VGGSound, AVE, and UCF101, and guess what, I found out MLA is not superior to UME(Uni-modal Ensemble) on all these datasets with the same (epoch, batch size, lr).

I'm not sure if what you're saying is true, nor have I done experiments outside of the Food-101 dataset, but I think it's reasonable that you can come up with this figure. A small piece of my current work is unified representational analysis, and I have found that there will be neuronal splitting in the dimensions reported in the article. Therefore, the actual representation dimension of each mode is almost less than 1/2 of the presupposition. You can refer to the UAVM's (https://github.com/YuanGongND/uavm/commits?author=YuanGongND) report. I conducted a comparative experiment, consistent with the backbone of QMF, and got a conclusion of about 85. By comparison, concat is around 86 and the weight fixed 50% is around 92.

I agree that the head may split into 2 parts for audio and visual when less than 50% of the dimension works well for each modality. I think that's why MLA had almost the same performance as UME in my experiments. By the way, I think you can try some audio-visual datasets. The text encoder is too powerful on Food101(I got 86.7% with pre-trained Bert) so there is not much room for the visual encoder to cooperate with the text encoder.

I'm glad to get your approval. This also strengthened my determination to finish this work. Food-101 feeds too many people, and it's a classic example of what I want to illustrate. My paper is not to brush points, but to analyze the structural characteristics of a typical network. In this article I will cite and analyze its behavior. Therefore, I do not intend to change my evaluation plan. Your results are basically consistent with the results of QMF, and the result of my operation is 86.71%.

Looking forward to observing your work in the future! Currently, most of the work is based on concatenation, but concatenation only utilizes as many dimensions as the number of classes. I believe that a theoretical analysis of the encoder's output features will help us understand the underlying mechanisms of multimodal fusion.

Thank you for your encouragement, and I greatly admire your rigorous spirit. I believe that work like yours in designing elaborate experiments will get us to the perfect day before interpretability is perfect.

CXianRen commented 1 week ago

Hi guys, not sure if you've found the m3ae pre-training model, here I share the one might work for it: m3ae_public where they provide the pre-training model, and the method converting to pytorch format.

Mannix-D commented 1 week ago

Hi guys, not sure if you've found the m3ae pre-training model, here I share the one might work for it: m3ae_public where they provide the pre-training model, and the method converting to pytorch format.

Have you loaded this model successfully? The model provided on this website is difficult to load. The converting code is nonexistent. The paper's code also can't load the m3ae.pkl.

CXianRen commented 1 week ago

Hi guys, not sure if you've found the m3ae pre-training model, here I share the one might work for it: m3ae_public where they provide the pre-training model, and the method converting to pytorch format.

Have you loaded this model successfully? The model provided on this website is difficult to load. The converting code is nonexistent. The paper's code also can't load the m3ae.pkl.

Aha, feel same as you. I am still working on it. Most of code provided online are not available now, which is really annoying. Will share if I make it work.