SMARTlab-Purdue / Husformer

This repository contains the source code for our paper: "Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition". For more details, please refer to our paper at https://arxiv.org/abs/2209.15182.
107 stars 30 forks source link

Where do the fused modalities go? #2

Closed suprola1017 closed 1 year ago

suprola1017 commented 1 year ago

After I run the code, only the model parameters are printed out, but no fused modes are printed out. Does anyone know where the fused modes go?

itsmohitanand commented 1 year ago

@suprola1017 Can you be more specific about what you want to print out? May be referring to the image and specifying what you want would help more.

zdz0086 commented 1 year ago

@suprola1017 Sorry, could you express your meaning more clearly?

suprola1017 commented 1 year ago

@suprola1017 Sorry, could you express your meaning more clearly?

When I was running the code provided by the paper, the output of the husformer paper you wrote did not include the focus of the paper -- the mode of fusion. This paper should not extract all the modes of the data set and use husformer to carry out multi-mode fusion, but the mode of fusion was not output after code training. Only the parameters of the model are output, so I would like to ask where is the mode after fusion?

zdz0086 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

suprola1017 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

Sorry, I didn't make it clear. I didn't try to do image fusion. I just wanted to find out where the multimodal data originally intended to be fused by this code went, such as the data after the fusion of multimodal data in the DEAP data set.I think the last_hs variable in models.py stores the fused data, but I don't know if I'm wrong.

zdz0086 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

Sorry, I didn't make it clear. I didn't try to do image fusion. I just wanted to find out where the multimodal data originally intended to be fused by this code went, such as the data after the fusion of multimodal data in the DEAP data set.I think the last_hs variable in models.py stores the fused data, but I don't know if I'm wrong.

The question about what is fused data has been expressed in out paper Fig.1, the meaning of codes in models.py is the same as the process of Fig.1, specifically, we labeled self_attention and cross_modal attention. Finally fused data is 'last_hs2' in models.py

suprola1017 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

Sorry, I didn't make it clear. I didn't try to do image fusion. I just wanted to find out where the multimodal data originally intended to be fused by this code went, such as the data after the fusion of multimodal data in the DEAP data set.I think the last_hs variable in models.py stores the fused data, but I don't know if I'm wrong.

The question about what is fused data has been expressed in out paper Fig.1, the meaning of codes in models.py is the same as the process of Fig.1, specifically, we labeled self_attention and cross_modal attention. Finally fused data is 'last_hs2' in models.py

Thank you very much for your help! I have one last question, there is a variable combined_dim=30 in the models.py file, which seems to determine the dimension of the final fusion data (the last dimension), I would like to know what is the meaning of combined_dim, and what is the meaning of 30, can I change it?

zdz0086 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

Sorry, I didn't make it clear. I didn't try to do image fusion. I just wanted to find out where the multimodal data originally intended to be fused by this code went, such as the data after the fusion of multimodal data in the DEAP data set.I think the last_hs variable in models.py stores the fused data, but I don't know if I'm wrong.

The question about what is fused data has been expressed in out paper Fig.1, the meaning of codes in models.py is the same as the process of Fig.1, specifically, we labeled self_attention and cross_modal attention. Finally fused data is 'last_hs2' in models.py

Thank you very much for your help! I have one last question, there is a variable combined_dim=30 in the models.py file, which seems to determine the dimension of the final fusion data (the last dimension), I would like to know what is the meaning of combined_dim, and what is the meaning of 30, can I change it?

Yes, this is a mutable variable, and you could change it however you want, but 30 is a better setting of our experiment.

suprola1017 commented 1 year ago

@suprola1017 Yes, I understand your question. You want to obtain multimodal fusion images from the network, but this functionality is not directly available in this repo. You can achieve this by creating a Python program to output the images. Here's a general guideline on how you can approach this: Navigate to the 'module' folder and inspect the code. Locate the 'TransformerEncoder' class in the relevant file (it may be in a file like transformer.py or models.py, depending on the codebase you're working with). Examine the forward function within the 'TransformerEncoder' class. This function should take inputs from the multimodal layers and fuse them to generate an output. Modify the forward function to return the fused output as an image. You can use libraries like NumPy, OpenCV, or PIL to handle image data in Python.

Sorry, I didn't make it clear. I didn't try to do image fusion. I just wanted to find out where the multimodal data originally intended to be fused by this code went, such as the data after the fusion of multimodal data in the DEAP data set.I think the last_hs variable in models.py stores the fused data, but I don't know if I'm wrong.

The question about what is fused data has been expressed in out paper Fig.1, the meaning of codes in models.py is the same as the process of Fig.1, specifically, we labeled self_attention and cross_modal attention. Finally fused data is 'last_hs2' in models.py

Thank you very much for your help! I have one last question, there is a variable combined_dim=30 in the models.py file, which seems to determine the dimension of the final fusion data (the last dimension), I would like to know what is the meaning of combined_dim, and what is the meaning of 30, can I change it?

Yes, this is a mutable variable, and you could change it however you want, but 30 is a better setting of our experiment.

I apologize, but I still don't quite understand what the variable "combined_dim=30" represents. I saw in the research paper that the Transformer Hidden Unit Size should be set to 40. Could you please tell me which variable in the code corresponds to the Transformer Hidden Unit Size so I can better understand?

R7-Robot commented 1 year ago

Hi, the para of "combined_dim" is not related to any transformer layers. Instead, it refers to the shared dimensionality of the features from each modality after passing through the temporal convolution layer. Please refer to section III B in our paper.

suprola1017 commented 1 year ago

Sorry, I would like to ask for your help again. I have been debugging my code for some time, but I still couldn't achieve 90% accuracy on the Pre-DEAP dataset. Is it due to my hyperparameter settings? I strictly followed the specifications in the paper for the hyperparameters I could find, but I didn't come across the set of hyperparameters for "Focal Loss". Could this be the reason for my accuracy issue?

R7-Robot commented 1 year ago

Sorry to hear that. Have you used the 'Pre-DEAP.py' in the 'make_data' file to prepare the .pkl? In our experimental setting, the input sample is a 2-D feature matrix extracted from a 1-second segment, and we shuffle all samples to conduct cross-validation. The performance may vary due to the experimental setting and the cross-validation strategy you used.

suprola1017 commented 1 year ago

"Focal Loss"

Yes, I followed the instructions in the 'make_data' file and used 'Pre-DEAP.py' to prepare the .pkl file. However, despite strictly adhering to the procedure and not making any modifications to achieve the desired performance mentioned in the paper, the results did not meet my expectations. I'm wondering if the issue could be related to the hyperparameter settings. Specifically, I couldn't find the set of parameters for "Focal Loss". Could you please provide some guidance or clarification on this? Thank you for your assistance.

cucutetekekeaiai commented 1 year ago

I also have the same problem, strictly following the source code of the article, but the experimental results did not meet expectations. Is the cross validation in your experiment using 10. kpl files processed from the Pre-DEAP.py file as data, training 10 models separately, and testing them on a test set of these 10 sets of data to obtain an average accuracy. During the reproduction process, there is a significant difference in accuracy between the training and validation data for each. kpl flie data. May I know which parameters I need to adjust.

zdz0086 commented 1 year ago

@suprola1017 and @cucutetekekeaiai I apologize for any misunderstanding. Please note that the code available on GitHub has not been updated with our latest experimental configurations; this update will occur once our paper is officially published.   In the interim, I am pleased to provide some suggestions for your consideration: consider shifting the optimizer to Stochastic Gradient Descent (SGD), and increasing the number of training iterations, preferably within the range of 120-180. Implementing these adjustments should help you achieve results closer to those reported in our study

suprola1017 commented 1 year ago

@suprola1017 and @cucutetekekeaiai I apologize for any misunderstanding. Please note that the code available on GitHub has not been updated with our latest experimental configurations; this update will occur once our paper is officially published.   In the interim, I am pleased to provide some suggestions for your consideration: consider shifting the optimizer to Stochastic Gradient Descent (SGD), and increasing the number of training iterations, preferably within the range of 120-180. Implementing these adjustments should help you achieve results closer to those reported in our study

Thank you for your understanding. I appreciate your suggestions and taking the time to provide guidance. I look forward to the publication of your paper and the updated code. Thank you again for your assistance.

R7-Robot commented 1 year ago

@suprola1017 and @cucutetekekeaiai I apologize for any misunderstanding. Please note that the code available on GitHub has not been updated with our latest experimental configurations; this update will occur once our paper is officially published.   In the interim, I am pleased to provide some suggestions for your consideration: consider shifting the optimizer to Stochastic Gradient Descent (SGD), and increasing the number of training iterations, preferably within the range of 120-180. Implementing these adjustments should help you achieve results closer to those reported in our study

Thank you for your understanding. I appreciate your suggestions and taking the time to provide guidance. I look forward to the publication of your paper and the updated code. Thank you again for your assistance.

You could try again with the recommended para we updated.