evonneng / learning2listen

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)
105 stars 10 forks source link

Abrupt Jumps in listener expression #8

Open rohitkatlaa opened 1 year ago

rohitkatlaa commented 1 year ago

I tried generating output from the data provided with the paper and I found that there are consistent jumps between frames. Most of the times it occurs every 32 frames but it also occurs at interval of 16 frames also. Is this a usual behaviour of the model? I am also attaching a video that I generated using the model.

https://user-images.githubusercontent.com/42460632/197406589-6f71bad6-ffce-4f85-afc6-ebc0814db622.mp4

evonneng commented 1 year ago

Hmmm it seems a lot of people are having this issue. The video results look extremely noisy in comparison to what I’m used to. I’m working with others to debug this issue on the released version. I will post usual results and also a few debugging ideas soon. Sorry for the inconvenience.

rohitkatlaa commented 1 year ago

Thank you @evonneng, These are the steps that I took: When I initially tried to use the code to generate the output, the outputs had 32 empty frames every time, so I had to skip the first 32 empty frames that were being generated by starting the loop from 32 instead of 0 in line. After this I gathered together data of a particular video and then tested it, I faced jumps every 32 frames. These are the steps that I took. I am not exactly sure how to generate the output for a video of arbitrary length. I hope you can help me in generating the input in the correct format.

evonneng commented 1 year ago

Hmmm what you are saying sounds reasonable. Usually I 0 our the first 32 frames but that should not affect the results since masking is used anyways son first 32 frames are discarded.

Could you please describe how you decode the generate sequence?

For reference, here's an example of how a long-sequence result should look like. https://user-images.githubusercontent.com/14854811/199045339-a8be5fe3-4959-4b3f-834a-6d40bd519ed1.mp4

rohitkatlaa commented 1 year ago

If by decode you mean, generating the final output, Here are my steps: I run the generated output via passing the exp and pose output params through the FLAME model using the code as reference. I am also using the default configuration that is used in the Flame_Pytorch repo. This gives me the 3D output.

If you are talking about the decoding process in VQ-VAE, I am following the same code provided in the test_vq_decoder.py i.e generating 32 frames prediction and appending the first 8 from each to the final sequence (code), then decode this sequence in steps of 32 (code).

@evonneng Could you please mention the video name(from the dataset used in the paper) and the frame numbers that were used to the generate the output you mentioned in your previous post. I think it could be helpful if we can look at the output generated from the same input

rohitkatlaa commented 1 year ago

Ping.

Daksitha commented 1 year ago

@evonneng and @rohitkatlaa, This is the line that cause every first 32 frames to be zero code.

Commenting it would remove this effect, however, I am not sure how would that affect the overall autoregressive predictions. Maybe @evonneng can comment on this

evonneng commented 1 year ago

Thanks for the comments.

Re the first 32 zero's: I set the first 32 frames to 0 to make sure we don't see any ground truth motion when outputting the predictions. Removing that code could lead to some information leaking so the results might improve slightly since it is essentially seeing past ground truth listener motion. During training, since I do random masking, it should learn to output motion properly if it sees leading 0's.

Re jumpy results: I have worked with others on the motion generation, and we were able to reproduce similar results using this codebase. A few things to note that might help with generation.

Hope this helps!

wangzheng1209 commented 1 year ago

@evonneng Hi, evonneng. Could you explain more about the "random walk"? Does it mean randomly sampling of the codebook to generate a sequence, but the generated result is still good due to the non-deterministic mapping? Hope for your reply!