GuyTevet / motion-diffusion-model

The official PyTorch implementation of the paper "Human Motion Diffusion Model"
MIT License
3.14k stars 340 forks source link

Is it possible to train the model with video dataset of multiperson moving? #44

Closed venturaEffect closed 1 year ago

venturaEffect commented 1 year ago

Hi!

First of all congrats on your project. Was looking how to contact you and this was the only option I found.

I've seen that motion is created based on prompt.

My question is simple, is it possible to train the model with a video dataset of people moving?

As I seen so far at your video is that it is only one person. Could it work with multiperson?

If it isn't possible, do you know any project that works in this direction?

Thanks for taking your time

GuyTevet commented 1 year ago

Thanks @venturaEffect ! I invite you to try:)

Here is one reference I know for multi-character motion generation: Multi-Person 3D Motion Prediction with Multi-Range Transformers

venturaEffect commented 1 year ago

Thanks @GuyTevet for your answer. What I've seen is that it "predicts" the future movement of a multi-person scene. What I'm looking is to train from a video dataset of people moving and then from prompt generate the movements.

Another question is if it is possible to translate the movements generated to "dress" this pseudo-human with people generated with AI. Something like Stable Diffusion creating pictures of people but in motion. So, it would get the motion and give this motion to people generated.

Hope it wasn't to confusing :)

GuyTevet commented 1 year ago

By video do you mean pixels data, or human joints angles (as we do)?

venturaEffect commented 1 year ago

Joint angles.

I've seen another project Move Ai. The main different that I see from GRAML, is that they generate from given videos recorded at least from two cameras.

My interest is to train from a video dataset and and classify in categories. Then generate the movements from prompt as you do.

Hope it is clear. I'm really interested in to achieve that. Sorry if I interrupt you

GuyTevet commented 1 year ago

Not at all. So you want to extract (multi-person) joint angles from a single-view camera (e.g. a youtube video), and train on top of this data?

venturaEffect commented 1 year ago

Yes!

From single video camera I found a great project called Radical

  1. The whole idea is to get from a dataset of multi-person videos, the gender and age, race (if possible).
  2. Train the model from the given dataset to respond based on description.
  3. Then add to the "humanoids" shapes the characteristics of provided images from body and face.

I really appreciate your suggestions because you know for sure if it is possible to achieve that using your model or any other who I can contact to get info of how to achieve that.

Again, sorry for taking your time

GuyTevet commented 1 year ago

The motion capture quality looks good judging their demo. I think multi-person could be learned by MDM, some adaptations will be needed for sure. For that, you can take inspiration from prior work in the field, such as the one I sent.

venturaEffect commented 1 year ago

Sorry to interrupt you again 😅

But can't find the link you're referring to.

GuyTevet commented 1 year ago

Multi-Person 3D Motion Prediction with Multi-Range Transformers

venturaEffect commented 1 year ago

Thanks for the answer @GuyTevet but I'm not looking for motion predictions, I'm looking to record the movements of humans on a video. After this I want to use this dataset to train a model like yours that giving a prompt it shows the movements. Something like you have done with motion-diffusion-model.

GuyTevet commented 1 year ago

Yup this is possible, just consider that pose estimation algorithms often provide noisy predictions.