UttaranB127 / Text2Gestures

This is the official implementation of the paper "Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents".
https://gamma.umd.edu/t2g/
MIT License
26 stars 3 forks source link

Reproducing Pre-Trained Model Results #4

Closed TobiasHallen closed 2 years ago

TobiasHallen commented 3 years ago

I'm running into some issues reproducing the results of this implementation. Using the pre-trained model, everything works fine and the resulting gestures are convincing. Trying to reproduce these results by training a new model on the same dataset, however, has not yielded any results. I have attempted using both the parameters described in the paper (lr=0.001, epochs=600) as well as those given here as they are to no success. Resulting gestures are incoherent and the model does not linearly converge as I would have expected (for a 600 epoch session, the best mean loss was 0.27 at epoch 166/600). I was wondering what parameters the pre-trained model was trained under, and whether they match up to those as they are set in the given main.py file. Also, is this convergence behaviour as you would expect? Did you experience the same while training this model? I have attached one of the output videos of epoch 510 for reference, as well as the log file of that 600 epoch training session.

https://user-images.githubusercontent.com/32342223/117440869-8a618200-af2c-11eb-90e8-574552d43f75.mp4

log.txt

UttaranB127 commented 3 years ago

Based on the sample video, it seems the problem could be one of two things:

  1. The network is not sufficiently trained
  2. The predicted rotation vectors are not normalized

I think it's more likely that the problem is the former, since you are using the network out-of-the-box and the pre-trained model works fine. The loss function convergence is certainly not a linear decrease. The loss value keeps oscillating within a certain range after a few epochs, but the rotations themselves become smoother and more realistic with more epochs. Even if the loss value itself is not the lowest overall, you can try evaluating the network at some epoch greater than 300, where the loss value is within 10% of the lowest loss value overall.

Let me know if this helps.

TobiasHallen commented 3 years ago

Hi Uttaran,

I trained the model again under the parameters as given here in the repo, this time up to 1700 epochs. Still, the resulting animations are similar to the example above. I have attached the training log and resulting model below and I was hoping you might be able to evaluate it, to see whether the issue lies in the training or the evaluation on my end. Is it possible that the issue is still insufficient training? I would imagine that 1700 epochs should produce somewhat usable results.

model file log

UttaranB127 commented 3 years ago

Hi Tobias,

Sure, I'll take a look. Could be some bug in the network that I uploaded.

Best, Uttaran

On Tue, May 11, 2021, 22:26 Tobias Hallen @.***> wrote:

Hi Uttaran,

I trained the model again under the parameters as given here in the repo, this time up to 1700 epochs. Still, the resulting animations are similar to the example above. I have attached the training log and resulting model below and I was hoping you might be able to evaluate it, to see whether the issue lies in the training or the evaluation on my end. Is it possible that the issue is still insufficient training? I would imagine that 1700 epochs should produce somewhat usable results.

model file https://drive.google.com/file/d/18efpMk5_jX3_RgLp-etv7-YanHdUHolO/view?usp=sharing log https://github.com/UttaranB127/Text2Gestures/files/6460760/log.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/UttaranB127/Text2Gestures/issues/4#issuecomment-838826486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGU4FXADVE6LEWE4QH4FB23TNFOU5ANCNFSM44JSCMOA .

TobiasHallen commented 3 years ago

Hey Uttaran,

I was just wondering whether you'd had opportunity to look at this issue. I cloned the repo and downloaded the data fresh and trained a new model to ~1200 epochs just to be sure, and it returned the same results. Might it be an issue with the parameters as they're set in the repo? The defaults set are different from those that are given in the 'help' strings of the argument parser.

UttaranB127 commented 3 years ago

Yes, the 'help' strings are not fully updated. I'll update that as a first step. I'll also be testing the network over this weekend to check for the issues you're encountering.

UttaranB127 commented 3 years ago

I've made commit 071e989b210bf114cc8d4370c7f01331ff35606d fixing an issue with displaying animations as videos. It seems the video you were creating did not consider the rotations in the proper order of joints. The training is working fine. I've also updated all the 'help' strings in main.py. Also, I would recommend visualizing the animations using the generated .bvh files in blender. Personally, I feel it's more convenient to see the animations in an interactable 3D space.

Let me know if you're still facing issues.

TobiasHallen commented 3 years ago

Hey Uttaran,

thanks for the fix, though I'm afraid it doesn't seem to have fixed the issue. I should have been more clear in my earlier descriptions, but I've been evaluating the results in Blender all along, and they matched up to the video output. Having pulled the new commit, the resulting generated (.bvh) gestures are still similarly broken. Have you tested the fix by training a new model?

UttaranB127 commented 3 years ago

Yes, I've retrained the model for about 500 epochs. Although the gestures are not great after 500 epochs, they are at least realistic, i.e., the agent is in a sitting position and moving its arms. Could you share the bvh files you're getting?

TobiasHallen commented 3 years ago

Thanks for getting back so quickly. I'll quickly retrain a new model to make sure I'm not missing anything before wasting your time any further. It should be more or less ready by this evening. Here's an example of a bvh file I generated from an old model, ~1200 epochs: link

UttaranB127 commented 3 years ago

Don't worry about it, let me know how it works out. I'm also attaching the log file and a sample bvh file that I obtained after retraining.

log.txt root.bvh

UttaranB127 commented 3 years ago

Actually, I'm running some more tests. The results can look arbitrary till about 1K epochs, and for losses above 0.2. I'm getting an estimate of when you can start expecting realistic results. The bvh I shared earlier seems to not be a very reliable reflection of the training.

UttaranB127 commented 3 years ago

Hi Tobias,

Are you using the generate_while_train method for generating the results? Note that it is slightly different from the forward pass. One limitation of our transformer-based approach was that it could not handle long-term predictions without a substantial history. The training process takes in partial history and generates for the short term, whereas in the final testing, we use a longer history to make full-sentence predictions. Using the training routine for full-sentence predictions will not not work correctly, unfortunately. We are working on overcoming this limitation as part of our follow-up work.

TobiasHallen commented 3 years ago

Hi Uttaran,

thanks for the information. I have not been using the generate_while_train method for any of my recent models. The new model is still training, but should hopefully be finished by this evening or tomorrow morning.

UttaranB127 commented 3 years ago

If the problem persists, you can try using the glove version (main_glove.py) or our provided pre-trained model as the initialization of the network for other datasets you're planning to use. You'll need to perform parameter tuning anyway for the new datasets.

TobiasHallen commented 3 years ago

Hi Uttaran, I had a couple more questions:

With the latest update to the best model loading, it's no longer possible to explicitly define a starting epoch unless it is also coincidentally the model with the lowest loss. If a training session is interrupted, is it viable to simply resume the training by specifying the relevant start epoch as the most recently saved model?

Also, would it be possible for you to attach the log file of your recent successful model training? I'd like to compare to my own to see if I can spot any obvious dissimilarities.

UttaranB127 commented 3 years ago

Thanks for pointing it out, I've added the capability to load at a specified epoch in the latest commit 04dde7a9187475741c68e329672f73591e9b11cf. It is certainly reasonable to resume the training from the most recently saved model. Currently, the code automatically saves the model after every 10 epochs.

I'm also sharing the log file from our submission, which is the best version of the training we had and also the one for which we provide our pretrained model. I'll also try and share a log file from our most recent experiments. It might take a while since we're trying to figure out the parameters to improve the functionalities of our network.

log_for_pretrained_model

TobiasHallen commented 3 years ago

I've just finished training another model to 1500 epochs, again with the same gibberish results. In the interest of clarity, I trained the model with no changes made other than changing the 'Train' parameter to true, and I evaluated the model similarly with no changes other than setting 'Train' back to false and setting the epoch in the generate_motion() call to 1500. Is this the same process you would have used to train your latest working model?

Also, comparing my own training log to the one corresponding to the pretrained model, I notice that the loss values are quite different. My values tend to oscillate between ~0.7 and ~1.2, whereas in the above log file they seem to be closer to ~0.25. Could you think of a reason why that might be?

Finally, is the data archive you are using the same one that is linked here in the repo? I am running out of ideas as to why I cannot seem to reproduce any results, barring possible differences in computer hardware. Could you see that having any impact?

UttaranB127 commented 3 years ago

Hi Tobias,

We have been conducting further experiments with our model since setting up this repo. It could be that some of the hyperparameters were altered somewhere, leading to an unstable model. We're working on the fixes for it. As I mentioned in my previous comments, I would recommend the following for now:

  1. Try training the model on your own dataset. You'll need to play around with the hyperparameters anyway for a new dataset. You can use our pretrained model as an initialization of the network for your own dataset.
  2. Try the glove version main_glove.py. It should have an older and stable version of the model.
TobiasHallen commented 3 years ago

When using main_glove.py, there seem to be some missing files (such as a glove.6B.300d.txt) used to build the embedding table. Are these files available anywhere?

UttaranB127 commented 3 years ago

Yes, glove models are available for download here: http://nlp.stanford.edu/data/ We use the glove 6B pretrained model: http://nlp.stanford.edu/data/glove.6B.zip

TobiasHallen commented 3 years ago

I'm currently exploring the possibility that there is a mistake in how pytorch and cuda are set up on my machine. Would you happen to know roughly to what percentage the GPU would be engaged during training? Mine personally seems to barely be utilized at all.

UttaranB127 commented 3 years ago

The percentage of GPU used depends entirely on the GPU you are using. For example, I'm using an NVIDIA GeForce GTX 1080 Ti, and the typical usage is around 40% when I run main_glove.py. You can check the device property of your torch variables to see if they're stored in the GPU or in the CPU. This might help: https://stackoverflow.com/questions/65381244/how-to-check-if-a-tensor-is-on-cuda-in-pytorch. Also check if the network model is initialized in the GPU or in the CPU.

UttaranB127 commented 3 years ago

Hi Tobias,

Just an update that I've reverted main.py to an older, stabler version that should work correctly. Let me know if you end up using it.

TobiasHallen commented 3 years ago

Hey Uttaran,

I actually meant to get back to you yesterday. I managed to train a working model with main_glove, and have started adapting it to my new data set. I'll definitely check out the changes made, though, and see if I can switch back to main and get that to work. Thanks for all your help.

TobiasHallen commented 3 years ago

Hello again,

I just wanted to thank you again for all your help, I was able to generate results with my own data comparable to those generated with the MPI archive. I recently came across your video demonstration of the project here, however:

https://www.youtube.com/watch?v=Hvovao_jUzk

I was wondering under what parameters it was possible to generate such well-defined gestures. In my own experience with this approach, the models generated have been relatively lackluster in terms of active gesticulation, instead mostly settling into an average pose, with little difference between individual results. Would you have any advice on a good place to start in terms of hyper-parameter tuning in order to improve this? Did you encounter similar issues yourself? Thanks in advance!

UttaranB127 commented 3 years ago

Hi Tobias, glad to know our method is working for you. What you mentioned is actually one of the limitations of our approach: to generate motions of the quality shown in the paper, you would need a significantly long history (the longer the better, since the current transformer architecture is not that good at generating diverse motions). An alternate would be to explore the best possible transformer architecture that can learn diverse motions from shorter histories. Transformers seem to be notoriously hard to train to learn long-term predictions from short histories, especially with limited computational resources. We have also explored a different architecture using RNNs that I can share with you once we are able to publish it. Meanwhile, you can take a look at the work on gestures from trimodal context to get some idea on how to generate diverse motions from short histories (it uses RNNs though, not transformers). Hope that helps!

Uttaran

TobiasHallen commented 3 years ago

Hi, when you say gesture history, is that referring to the sample count of the data-set on which the model is trained, or another metric?

UttaranB127 commented 3 years ago

It refers to the number of gesture frames that need to go in to our network for each sample in the dataset.

TobiasHallen commented 3 years ago

Is the gesture data used for the examples generated for the video then still the MPI data as it is given in this repo? Was the sample data interpolated to further increase the frame counts, or were longer samples used, from an entirely different data set?

UttaranB127 commented 3 years ago

The examples shown are from the MPI data as given from the repo. We did not perform any data interpolation in this project.

TobiasHallen commented 3 years ago

Is the training or synthesis method substantially different from that as it is given here then? As above, the samples I am able to generate have never displayed such varied movement, rather settling on an average pose.

UttaranB127 commented 3 years ago

There was an issue with the main.py file that was uploaded earlier, which led to noisy results/averaged poses. Are you saying the current version is also not synthesizing varied movements? Or are you able to replicate the results in the paper but not make it work for other data? In the latter case, I should reiterate that we have had generalization issues with the transformer network, which we haven't managed to solve successfully (maybe it's a case of not having enough training data).

TobiasHallen commented 3 years ago

I believe I was having issues replicating the diversified movement even with the MPI dataset, but I will retrain to confirm. Would you recommend setting the frame drop to one to maximize the amount of information per input sample, or would that make no difference?

UttaranB127 commented 3 years ago

That would significantly increase the time per iteration, and would make the motions appear smoother or "richer". But I'm not sure if that'll improve the diversity of the motions. We experimented with frame drops of 2 and 4 and the diversity of the motions wasn't too different. Could you share any of your current generated samples and your log file?

TobiasHallen commented 3 years ago

I'm currently retraining with the MPI data set, I've since lost my model with the original data. I'll update when it reaches ~500 epochs.

UttaranB127 commented 3 years ago

Sure, take your time. I won't be available to revisit this project before late August myself.

TobiasHallen commented 3 years ago

Hello again, the I just evaluated at epoch 522, and the resulting gestures are as described above - fairly motionless. In fact, all 144 generated samples are near identical in all but frame count. Link with 2 generated samples.

UttaranB127 commented 2 years ago

Hi Tobias, apologies for the delay in getting back on this. The training model is actually sensitive to the initialization, especially because the transformer network requires more data than we have in the MPI dataset. This has been a driving issue for us to move towards a more stable implementation, which I highly recommend: https://github.com/UttaranB127/speech2affective_gestures. If you plan on using this version of the model, I would recommend using our pre-trained model as the initialization for your purposes.