Any other tips for improving the results?

Runa5151 commented 1 year ago

Thank you for sharing this excellent project. I've tried your code many times, Unfortunately, I could not get good results. I used some facial expression DB for training(except RAF dataset). These 8-class (majority voting) expression images(not aligned) were augmented with the 'augment-2' method you provided in generate_training_data.py. However, the best performance accuracy was 83% on the validation set in train.py.
Even though I tried with the codes you suggested, other than recording 80% when using only FERPlus, all attempts to mix and augment various facial expression datasets failed to exceed 80% accuracy. So I ask you, would you provide a pre-trained model? If not, would you share some tips for improving the accuracy?

Thank you. Have a good day.

ZBigFish commented 1 year ago

In fact, I have reason to suspect that this paper has carried out data exaggeration and falsification or even fiction. The low-level feature extraction method mentioned in the article proved to cause a decrease in model performance after I moved it to other baseline networks. In the case of high-level feature extraction, it is mentioned in the article that the method of splicing feature maps of different scales into a Transformer Encoder input will actually produce inconsistencies in the channels and sizes of different feature maps.Although the shape of the feature map is modified (upsampling + convolution layer) according to the method given in the article, so that it can obtain the same shape even if it is spliced. But it is logically contrary to the multi-head self-attention mechanism used by the Transformer architecture. As you can see, the mhsa mechanism is designed to explore the interdependent relationship between different parts within the image, rather than the comparison and relationship between images. The method mentioned in the article is no different from using the Transformer architecture to learn the connection between several highly similar pictures (because it is the feature map of the same picture, but the depth is different). So, I maintain my high suspicion of the veracity of this paper. Moreover, I have carried out a variety of experiments on the published code, and the results show that they are far from reaching the results announced in the article. After a simple analysis, I think that the 80% accuracy achieved by this project has nothing to do with the architecture mentioned in the article. It is just that the feature pyramid network and the incomplete Transformer architecture form a sufficiently deep learning model. I made reasonable modifications and change to the model, and finally surpassed the performance mentioned in the article, and reached the SOTA score, but it looks completely different. The general idea is to input the multi-scale feature maps in FPN into separate Transformers, and finally perform feature fusion and classification. In the low-level feature processing stage, we propose a new solution to strengthen the learning of primary features. Our paper is in the submission process, and once our paper is accepted, I will release the code to a new repository on this account. Feel free to refer to our work when the time comes. Have a nice day!

Runa5151 commented 1 year ago

Thank you very much for detailing the article based on your sharp comments. Actually, this article is not easy for me, so I ask for your understanding that I can't give you more opinions even though you explained it easily. But I could entirely understand what you wanted to say. As you said, I'm excited to hear that you've conducted the research overcoming the presented weakness. I sincerely hope that your research will be accepted by a top conference and journal, and I hope to see your research through a new repository soon. Good luck and have a nice day!

ramsai0206 commented 1 year ago

Which torch package did u use I tried with the given one I was not able to do

ZBigFish / FER-VT

Any other tips for improving the results? #2