CVI-SZU / ME-GraphAU

[IJCAI 2022] Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition, Pytorch code
MIT License
160 stars 39 forks source link

Where can I get test dataset? #4

Closed noirmist closed 2 years ago

noirmist commented 2 years ago

To my knowledge, cross-validation needs sparate test dataset.

Here is the cross validation workflow. cross_validation

From your paper, I cannot find any information, how to prepare test dataset. Could you explan the details?

And when you preprocess the face data with MTCNN, did you align the faces? If so, Could you also explain about alignment process?

Thank you!

lingjivoo commented 2 years ago

Hi. We conduct a subject-independent three folds cross-validation for each dataset and report the average results over 3 folds., which follows the same protocol as previous studies [1,2,3]. So there is no independent test data. And we use MTCNN to perform face detection and alignment for each frame and crop it to (256 x 256) and further cropped it to 224 × 224 as an input for backbones. Indeed, we align the faces with the landmarks provided by the MTCNN tool to make sure that we can get an image full of the face. You can also use other tools like dlib or Retinaface. You can see the preprocessed image sample in our data folder。

[1] Zhao K, Chu W S, Zhang H. Deep region and multi-label learning for facial action unit detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3391-3399. [2] Song T, Chen L, Zheng W, et al. Uncertain graph neural networks for facial action unit detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(7): 5993-6001. [3] Shao Z, Liu Z, Cai J, et al. JAA-Net: Joint facial action unit detection and face alignment via adaptive attention[J]. International Journal of Computer Vision, 2021, 129(2): 321-340.

noirmist commented 2 years ago

Hi, when I train your model with DISFA dataset. I can get the best validation f1 score at second epoch both stage1 and stage2. So, I'm curious about how to choose the best weights. From your paper, you train your model with 20 epochs. After then did you choose the best f1 score weight or use the last 20 epoch weight?

lingjivoo commented 2 years ago

The early stop strategy is used for choosing weights. More specifically, you can select the best validation result of one fold and record the number of the epoch and stop the training processes of the other folds according to it.

Supltz commented 2 years ago

Hi. We conduct a subject-independent three folds cross-validation for each dataset and report the average results over 3 folds., which follows the same protocol as previous studies [1,2,3]. So there is no independent test data. And we use MTCNN to perform face detection and alignment for each frame and crop it to (256 x 256) and further cropped it to 224 × 224 as an input for backbones. Indeed, we align the faces with the landmarks provided by the MTCNN tool to make sure that we can get an image full of the face. You can also use other tools like dlib or Retinaface. You can see the preprocessed image sample in our data folder。

[1] Zhao K, Chu W S, Zhang H. Deep region and multi-label learning for facial action unit detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3391-3399. [2] Song T, Chen L, Zheng W, et al. Uncertain graph neural networks for facial action unit detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(7): 5993-6001. [3] Shao Z, Liu Z, Cai J, et al. JAA-Net: Joint facial action unit detection and face alignment via adaptive attention[J]. International Journal of Computer Vision, 2021, 129(2): 321-340.

Hi, I had a look into your code. Could you specify where did you test your model? There should be a independent test set for each fold according to the previous data splitting methods you cited. Please have a look at these repositories AUnets,JAA-net . From your code, it seems like you take the test fold for validation. However, the validation set should be split in the training fold only. It says here in AUnets:

"Based on DRML paper, we use their exact subject-exclusive three fold testing (These subjects are exclusively for testing on each fold, the remaining subjects are for train/val):"

I am sorry if I got it wrong but I will appreciate if you can clarify this.

lingjivoo commented 2 years ago

I have to admit there are some problems with current protocols [1,2, 3, 4]. You can see the code in JAA-net. In train_JAAv1.py,
Line 39 : dsets['test'] = ImageList(crop_size=config.crop_size, path=config.test_path_prefix, phase='test', transform=prep.image_test(crop_size=config.crop_size), target_transform=prep.land_transform(img_size=config.crop_size, flip_reflect=np.loadtxt( config.flip_reflect)) ) and Line 155:
f1score_arr, acc_arr, mean_error, failure_rate = AU_detection_evalv1( dset_loaders['test'], region_learning, align_net, local_attention_refine, local_au_net, global_au_feat, au_net, use_gpu=use_gpu) . In test_JAAv1.py, Line 21:

dsets['test'] = ImageList(crop_size=config.crop_size, path=**config.test_path_prefix**, **phase='test'**,
                                   transform=prep.image_test(crop_size=config.crop_size),
                                   target_transform=prep.land_transform(img_size=config.crop_size,
                                                                        flip_reflect=np.loadtxt(
                                                                            config.flip_reflect))
                                   )

. So we can conclude that there is no exactly independent validation set in this protocol. You can also see this phenomenon in DRML code. In main.py, Line 32-35: train_sample_nb = len(dataset.train_dataset) test_sample_nb = len(dataset.test_dataset) train_batch_nb = len(dataset.train_loader) test_batch_nb = len(dataset.test_loader) .

In today's protocol, many works [1, 2, 3] follow the JAANet's protocol, because it opens the train and test lists and is convenient for comparing various methods. Indeed, there are severe problems with the absence of an independent test set. And, even worse, in JAANet's code, they deleted some bad cases in test lists, which increases the evaluation performance. In our work, we still use this subject-independent three folds cross-validation for comparison but we supplement the lost frames in test and train lists.

[1] Song T, Cui Z, Zheng W, et al. Hybrid message passing with performance-driven structures for facial action unit detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6267-6276.

[2] Song T, Chen L, Zheng W, et al. Uncertain graph neural networks for facial action unit detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(7): 5993-6001.

[3] Jacob G M, Stenger B. Facial action unit detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7680-7689.

[4] Li G, Zhu X, Zeng Y, et al. Semantic relationships guided representation learning for facial action unit recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 8594-8601.

Supltz commented 2 years ago

First, I would like to thank you for your reply and the work of releasing the code.

Most of the current work follows the protocol of the paper DRML. Actually, the first author of DRML used to be my supervisor in my undergrad and she messaged me that each fold is for Train/Val/Test but since that paper was published 6 years ago... She might not be sure about the details. Anyway, it seems that there are 2 kinds of data-splitting methods now based on their understandings of DRML existing in academia.

Without a doubt, the one without an independent test set is more likely to produce a higher F1 score. Therefore, it is not even fair to compare them together. Moreover, it's frustrating to see they messed up with the concept of Test and Validation in some of their codes.

And, even worse, in JAANet's code, they deleted some bad cases in test lists, which increases the evaluation performance.

I noticed this either :D

I do think that this issue should be addressed and clarified in future publications to avoid confusion.

Good luck with your research.