Svito-zar / gesticulator

The official implementation for ICMI 2020 Best Paper Award "Gesticulator: A framework for semantically-aware speech-driven gesture generation"
https://svito-zar.github.io/gesticulator/
GNU General Public License v3.0
123 stars 19 forks source link

What is the reason to do Euler_Angle to Expressional_Map conversion? #18

Closed kelvinqin closed 3 years ago

kelvinqin commented 3 years ago

Dear Taras, Can you please share your knowledge on why you do Euler_Angle2Exponential_Map conversion first and then build deep learning model? Is it because that Exponential_Map has some special characteristics for easy convergence?

A related question is why you don't consider to do Euler_Angle2Position conversion for model building?

Thanks for your sharing, Kelvin

Svito-zar commented 3 years ago

Dear Kelvin,

We represent human motion using exponential map, because this representation avoid numerical issues related to potential discontinuitites in joint angle values due to transitions from -pi to pi.

We pre-calculate all the features (not only motion features, but also speech and text features) before training the model in order to be efficient, since feature extraction is not so fast.

I hope that answers your question.

Best, Taras

kelvinqin commented 3 years ago

Dear Taras, Thanks, I think I can get the idea in some level, I may need to pick up more knowledge on what Exponential_Map is, can you please recommend an easy-to-understand tutorial on Exponential_Map?

If the major concern is about the discontinuity of Euler_Angle, why not to consider to use 3D coordinates (Position) directly to do model building which will not introduce discontinuity issue.

Have a nice day, Kelvin

kelvinqin commented 3 years ago

BTW, I am a speech person, that is why I lack the knowledge on motion,

Late on, I will show you my recent result on face mesh generation, which I directly use 3D coordinate to build the model, maybe you have other reason why not to use 3D coordinates to build gesture model?

Thanks, Kelvin

kelvinqin commented 3 years ago

https://user-images.githubusercontent.com/10486482/103012202-31dda980-4576-11eb-963d-5bf89224833e.mp4

Svito-zar commented 3 years ago

I don't know a particularly good tutorial on exponential map, but this could be a good starting point: http://www.cs.cmu.edu/~spiff/moedit99/expmap.pdf

As for using 3D coordinates - we don't do that because most of virtual characters and humanoid robots cannot be driven by 3D coordinates, they require joint angles. Hence we use representation which retains information about joint angles :it is easy to convert exponential maps back to join angles, while it is tricky to convert 3D coordinates to joint angles.

Is it clear now?

Btw, @kelvinqin, nice results with face mesh generation! :)

kelvinqin commented 3 years ago

Dear Taras, Thanks all for your sharing, very helpful.

Merry Christmas, Kelvin

kelvinqin commented 3 years ago

Dear Taras, I got one more question which is about the conversion from prediction result back to bvh. Thanks in advance.

In data processing phase, you convert the raw training data in bvh2feature.py (from EulerAngle to ExpMap), the pipeline is: data_pipe = Pipeline([ ('dwnsampl', DownSampler(tgt_fps=fps, keep_all=False)), ('root', RootTransformer('hip_centric')), ('mir', Mirror(axis='X', append=True)), ('jtsel', JointSelector(['Spine','Spine1','Spine2','Spine3','Neck','Neck1','Head','RightShoulder', 'RightArm', 'RightForeArm', 'RightHand', 'LeftShoulder', 'LeftArm', 'LeftForeArm', 'LeftHand'], include_root=True)), ('exp', MocapParameterizer('expmap')), ('cnst', ConstantsRemover()), ('np', Numpyfier()) ])

I guess, JointSelector function means, the feature you extracted for training includes 15 segments, which corresponds to 45 dims vector (15*3)

In model prediction phase, you call write_bvh.py to convert the prediction result back into EulerAngle: def write_bvh(datapipe_file, anim_clip, filename, fps): data_pipeline = joblib.load(datapipe_file[0]) inv_data = data_pipeline.inverse_transform(anim_clip) writer = BVHWriter() for i in range(0, anim_clip.shape[0]): with open(filename, "w") as f: writer.write(inv_data[i], f, framerate=fps)

When I look at temp.bvh file, I found it is a full body skeleton with everything (I use bvhacker to view it) instead of only 15 segments.

My question is what is the secrete to map 45-dim vector back into a full body skeleton?

One more question is, the result I got is a little different comparing with yours in https://vimeo.com/449190061. Not sure if it is because that I am use a different model? I will attached my result for you to take a look (arm movement is not so strong like yours)

Thanks so much for your guidance,

Kelvin

kelvinqin commented 3 years ago

Here is my result (run demo.py)

https://user-images.githubusercontent.com/10486482/103080996-fb556c80-4611-11eb-80ce-b9005d326103.mp4

Svito-zar commented 3 years ago

Your results look reasonable. There are slightly different because we have a slightly different model for the demo now.

Svito-zar commented 3 years ago

"My question is what is the secrete to map 45-dim vector back into a full body skeleton?" The pipeline extracts only 15 joints and remembers the positions of the other joints. When we call data_pipeline.inverse_transform(anim_clip) - it basically adds fixed values for the rest of the joints. So no magic here :)

If you have another question - please open another issue

kelvinqin commented 3 years ago

Dear Taras, Thanks a lot for your answer, very glad to understand ur code better now :-) Kelvin