Inference Model Peak Frame Computation?

ZebangCheng / Emotion-LLaMA

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

BSD 3-Clause "New" or "Revised" License

69 stars 5 forks source link

Inference Model Peak Frame Computation? #6

Open eliird opened 1 month ago

eliird commented 1 month ago

I have been trying to figure out how to use this for inference and evaluate other datasets without finetuning.

The scripts explain how you can use the model with the extracted features but I have a couple of questions regarding computing those features and would be glad if you can find the time to answer

does the deployed model on hugging face compute the Peak Activation frame by looping through all the frames? Because the inference time seems fast for that?
how does the inference model work with multiple faces in the video? It seems to work alright when I am testing even with multiple people but how is that handled in the deployed model?

ZebangCheng commented 1 month ago

In the initial phase of the project, we used OpenFace's Action Units System to filter the Peak Activation frame from all frames of the video sample. This Peak Activation frame was then used to obtain detailed facial expressions. However, to enhance the processing speed of Emotion-LLaMA demo on hugging face, we directly used the first frame of the video as the model input. If a simple and efficient method for quickly identifying the Peak Activation frame could be developed, it might improve the performance of Emotion-LLaMA.

The training data for Emotion-LLaMA primarily consists of videos with a single person. While there are videos with multiple people, the focus is on the emotions of one primary individual. Currently, Emotion-LLaMA is not well-equipped to handle scenarios where multiple individuals in a video exhibit different emotions. To recognize emotions for multiple people, one could try detecting each individual first and then assessing their emotions. This approach would require further exploration of datasets and multi-task training strategies.

eliird commented 1 month ago

Thank you for the prompt reply. I would like to further confirm a couple of things with you,

1 - deployed model on the hugging face just uses the first frame of the video to make the inference along with the audio features? It does not compute the Local Encoder and Temporal Encoder features? I am looking at the encode_image function in the code both in this repository and on the hugging face repository, but it seems to be only using the first image and audio in case the path of video is provided, and just the first frame of the video if list of PIL Image is provided?

Am I correct in understanding that or am I missing something?

2 - I am trying to evaluate your model on the MELD dataset, so I was wondering if I can use the deployed model directly for that. If the above explanation is correct, I should extract the features first like described in the paper and evaluate on those extracted features?

ZebangCheng commented 1 month ago

Due to space and budget constraints on the Hugging Face platform, our demo does not utilize the Local Encoder and Temporal Encoder features, with their feature vectors set to zero. Despite this, the demo still demonstrates strong robustness. To clarify, the Emotion-LLaMA Global Encoder uses only the first frame of the video if a list of PIL Images is provided.

If you are evaluating the model on the MELD dataset, you will need to extract the relevant features first. Extracting features beforehand helps avoid loading multiple encoders onto the GPU simultaneously, which can reduce the model's memory usage.

eliird commented 1 month ago

Thank you for the clarification.

Best