MooreThreads / Moore-AnimateAnyone

Character Animation (AnimateAnyone, Face Reenactment)
Apache License 2.0
3.06k stars 238 forks source link

Weird artifacts when training on open-source TikTok dataset? #85

Open BIGJUN777 opened 7 months ago

BIGJUN777 commented 7 months ago

Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? Snipaste_2024-01-30_16-08-19 One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.

jsonkcli commented 7 months ago

Hey BigJun, do you have twitter? I'd like to chat about training

BIGJUN777 commented 6 months ago

Hey BigJun, do you have twitter? I'd like to chat about training

@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.

jsonkcli commented 6 months ago

Hey BigJun, do you have twitter? I'd like to chat about training

@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.

I couldn't find your twitter, do you have WeChat or Discord?

BIGJUN777 commented 6 months ago

Hey BigJun, do you have twitter? I'd like to chat about training

@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.

I couldn't find your twitter, do you have WeChat or Discord?

Please check my twitter again, or give me your twitter and I will drop you a message. image

jsonkcli commented 6 months ago

Hey BigJun, do you have twitter? I'd like to chat about training

@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.

I couldn't find your twitter, do you have WeChat or Discord?

Please check my twitter again, or give me your twitter and I will drop you a message. image

@ov47i

xiaohutongxue-sunny commented 6 months ago

Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? Snipaste_2024-01-30_16-08-19 One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.

hello ,could you tell me what pose detection you used, the proposed pose detection algorithm predict me a shake pose sequence, which cause a worse output.

BIGJUN777 commented 6 months ago

Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? Snipaste_2024-01-30_16-08-19 One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.

hello ,could you tell me what pose detection you used, the proposed pose detection algorithm predict me a shake pose sequence, which cause a worse output.

DWpose. I think it is normal to get the quivery pose sequence because we detect the pose frame by frame, which lacks time consistency.

ManiaaJia commented 6 months ago

Hey BigJun,

I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions.

This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

BIGJUN777 commented 6 months ago

Hey BigJun,

I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions.

This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

ManiaaJia commented 6 months ago

Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

Your answer is very helpful to me. Thank you again!

YifuDeng commented 5 months ago

Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

Hey BIGJUN,

Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.

  1. How many GPUs did you use? My training time is longer than yours.
  2. How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
  3. Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
BIGJUN777 commented 5 months ago

Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

Hey BIGJUN,

Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.

  1. How many GPUs did you use? My training time is longer than yours.
  2. How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
  3. Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
  1. 8 GPUs and 64 bs in total.
  2. Resize and crop. Please refer to the paper or other open-source implementations.
  3. Of course, dive into the code and get what you want. In general, accurate face keypoints can facilitate expression learning, but it would be harmful when it contains much noise. I just aligned my implementation to the original paper and you need to figure it out by yourself.
YifuDeng commented 5 months ago

Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

Hey BIGJUN, Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.

  1. How many GPUs did you use? My training time is longer than yours.
  2. How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
  3. Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
  1. 8 GPUs and 64 bs in total.
  2. Resize and crop. Please refer to the paper or other open-source implementations.
  3. Of course, dive into the code and get what you want. In general, accurate face keypoints can facilitate expression learning, but it would be harmful when it contains much noise. I just aligned my implementation to the original paper and you need to figure it out by yourself.

Thank you so much for your sharing!

syorami commented 4 months ago

hi @BIGJUN777 , could you share how you process your dataset? Do you use extracted RGB images or get pose directly from original videos?

BIGJUN777 commented 4 months ago

hi @BIGJUN777 , could you share how you process your dataset? Do you use extracted RGB images or get pose directly from original videos?

I do not exactly understand what you mean. I got pose frames by frames.

xianrui-luo commented 3 months ago

Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!

Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.

Hey BIGJUN, Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.

  1. How many GPUs did you use? My training time is longer than yours.
  2. How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
  3. Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
  1. 8 GPUs and 64 bs in total.
  2. Resize and crop. Please refer to the paper or other open-source implementations.
  3. Of course, dive into the code and get what you want. In general, accurate face keypoints can facilitate expression learning, but it would be harmful when it contains much noise. I just aligned my implementation to the original paper and you need to figure it out by yourself.

Hi, @BIGJUN777 would you mind sharing how you resize and crop the images, I have seen other papers, but I am still confused, as if the image is cropped (e.g. tiktok), usually human body might get cropped in this process?