Open BIGJUN777 opened 7 months ago
Hey BigJun, do you have twitter? I'd like to chat about training
Hey BigJun, do you have twitter? I'd like to chat about training
@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.
Hey BigJun, do you have twitter? I'd like to chat about training
@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.
I couldn't find your twitter, do you have WeChat or Discord?
Hey BigJun, do you have twitter? I'd like to chat about training
@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.
I couldn't find your twitter, do you have WeChat or Discord?
Please check my twitter again, or give me your twitter and I will drop you a message.
Hey BigJun, do you have twitter? I'd like to chat about training
@jsonkcli Zhijun Liang. I rarely use Twitter, but I will pay attention to your message.
I couldn't find your twitter, do you have WeChat or Discord?
Please check my twitter again, or give me your twitter and I will drop you a message.
@ov47i
Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.
hello ,could you tell me what pose detection you used, the proposed pose detection algorithm predict me a shake pose sequence, which cause a worse output.
Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.
hello ,could you tell me what pose detection you used, the proposed pose detection algorithm predict me a shake pose sequence, which cause a worse output.
DWpose. I think it is normal to get the quivery pose sequence because we detect the pose frame by frame, which lacks time consistency.
Hey BigJun,
I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions.
This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Hey BigJun,
I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions.
This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Your answer is very helpful to me. Thank you again!
Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Hey BIGJUN,
Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.
Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Hey BIGJUN,
Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.
- How many GPUs did you use? My training time is longer than yours.
- How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
- Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Hey BIGJUN, Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.
- How many GPUs did you use? My training time is longer than yours.
- How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
- Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
- 8 GPUs and 64 bs in total.
- Resize and crop. Please refer to the paper or other open-source implementations.
- Of course, dive into the code and get what you want. In general, accurate face keypoints can facilitate expression learning, but it would be harmful when it contains much noise. I just aligned my implementation to the original paper and you need to figure it out by yourself.
Thank you so much for your sharing!
hi @BIGJUN777 , could you share how you process your dataset? Do you use extracted RGB images or get pose directly from original videos?
hi @BIGJUN777 , could you share how you process your dataset? Do you use extracted RGB images or get pose directly from original videos?
I do not exactly understand what you mean. I got pose frames by frames.
Hey BigJun, I was wondering whether you've been training your models using NVIDIA A100 GPUs, and could you kindly provide some details about the training duration in relation to the specific batch size and resolution settings employed? Specifically, I would appreciate it if you could share how long each phase—like the first stage and the second stage of training—took with their corresponding batch sizes and resolutions. This information would be immensely helpful for me to make a preliminary estimation of the associated training costs. Thank you so much~!
Refer to Open-AnimateAnyone, I rewrote the code according to my understanding. I conducted my experiments on NVIDIA A100 80G GPUs. On the Tiktok dataset, 512x512 resolution, 64 bs in stage1 while 4 bs in stage2. It took almost a day to train 50k steps in stage1 and about 8h to train 11k steps in stage2. I did not do much training speed optimization. By the way, this task is data-hungry, and I could not get satisfactory results as shown by the author on such a small dataset. Maybe I miss something. Have a try by yourself and looking forward to your good news.
Hey BIGJUN, Thank you for sharing! It's really helpful. I have a few more questions that I hope you can answer.
- How many GPUs did you use? My training time is longer than yours.
- How did you convert the data into a resolution of 512x512? Did you directly resize it or did you crop each video?
- Can Dwpose only extract keypoints for the body and hands? The pose sequence you provided is consistent with the original project, but Moore's pose sequence includes keypoints for the face. I'm wondering which pose do you think would be better for training.
- 8 GPUs and 64 bs in total.
- Resize and crop. Please refer to the paper or other open-source implementations.
- Of course, dive into the code and get what you want. In general, accurate face keypoints can facilitate expression learning, but it would be harmful when it contains much noise. I just aligned my implementation to the original paper and you need to figure it out by yourself.
Hi, @BIGJUN777 would you mind sharing how you resize and crop the images, I have seen other papers, but I am still confused, as if the image is cropped (e.g. tiktok), usually human body might get cropped in this process?
Hi there. When I trained on the TikTok dataset which has 340 videos, it was weird that it would generate some cotton-shaped artifacts around the person. Have you encountered this issue in your training? One possible reason I think is that it is hard for the model to learn the movement of long hair on limited data. Do you have any ideas? Thanks.