A PyTorch implementation of "Everybody Dance Now" from Berkeley AI lab. Including all functionality except pose normalization.
Other implementations:
yanx27 EverybodyDanceNow reproduced in pytorch
nyoki-pytorch pytorch-EverybodyDanceNow
Also check out densebody_pytorch for 3D human mesh estimation from monocular images.
For other necessary packages, Use pip install -r requirements
for a quick install.
./pose_estimator/compute_coordinates_for_video.py
. However, you won't be able to use tensorboard this way.To reproduce our results, download the Afrobeat workout Sequence from YouTube. (clipconverter is a great downloading tool.)
Open the mp4 file with imageio , remove the first 25 seconds (625 frames for our video), and resize the rest frames into 288*512(for 16:9 HD video). Save all the frames in a single folder named train_B
. It is highly recommended that the frames are named with their index numbers, like 00001.png
, 00002.png
.
Download a pre-trained pose-estimator from Yandex Disk and put it under subfolder pose-estimator
, then run the following script to estimate the pose for each frame and render the poses into RGB images.
python ./pose_estimator/compute_coordinates.py
The script is supposed to generate a folder named train_A
containing corresponding pose stickfigure images (also named as 00001.png
, 00002.png
, etc.), and a numpy file poses.npy
that contains estimated poses of size N*18*2, where N is the number of frames.
The numpy file is not necessary for training global generator, but we need it for training face-enhancer since we need to estimate and crop the head region from synthesized frames.
Note: You can also use Openpose or any other pose-estimation networks for this step. Just make sure you organize your pose data as suggested above.
Wrap train_A
, train_B
and poses.npy
into the same folder and put it under ./datasets/
.
The model is not fully trained/tested on other dancing videos. You are encouraged to play with your own dataset as well, but the performance is not guaranteed.
Empirically, to increase the change of success in training/testing, it is important that your video:
On the contrary, your training would possibly fail if your video contains
If you encounter any failcase, do not hesitate to leave an issue to let us know!
Download pretrained checkpoints:
./checkpoints/everybody_dance_now_temporal/
./face-enhancer/checkpoints/dance_test_new_down2_res6/
./face-enhancer/utils/
Prepare the testing sequence: Save the skeleton figures in a folder named test_A
, slice the corresponding pose coordinates from previously cached poses.npy
, and wrap them in a single folder (for example cardio_dance_test
) and put it under ./datasets/
.
In addition, the program supports using first ground-truth frame as a reference, so create a new folder test_B
and put inside the ground truth frame corresponding to the first item in test_A
(with identical file name of course).
Run the following command for global synthesis
sh ./scripts/test_full_512.sh
This will generates a coarse video stored in ./results/$NAME$/$WHICH_EPOCH$/test_clip.avi
and cache all synthesized frames for face_enhancer evaluation.
Run the face-enhancer to get the final result.
python ./face-enhancer/enhance.py
Prepare the dataset following the instructions above.
For pose2vid baseline, run the script
sh ./scripts/train_full_512.sh
If you wish to incorporate optical flow loss, run the script
sh ./scripts/train_flow_512.sh
Warning: this module will increase memory cost and slows down the training speed by 40% to 50%. Also it's very sensitive to background flow, so use it at your discretion. However, if you can accurately estimate the dancer's body mask, using masked flow could help with temporal smoothing. Please send a PR if you find masked Flowloss effective.
Rename your train_B
folder into test_real
(Or you can save a copy and rename it)
Test the global pose2vid network (either trained from Step I or initialized with downloaded pretrained model) with your train_A
dataset, save the results into a folder named test_sync
with matching names.
Open the face-enhancement training script at ./face_enhancement/main.py
, modify the dataset_dir, pose_dir, checkpoint dir, log_dir
variables, and run the script.
The default network structure is 2 downsample layers, 6 Resblocks, and 2 upsample layers. You can modify it for best enhancing effect, just change the corresponding parameters at line 22. Also the crop size is adjustable at line 23(default is 96).
Should you find this implementation useful, please add the following citation in your paper/open-sourced project:
@article{chan2018everybody,
title={Everybody dance now},
author={Chan, Caroline and Ginosar, Shiry and Zhou, Tinghui and Efros, Alexei A},
journal={arXiv preprint arXiv:1808.07371},
year={2018}
}
This repo borrows heavily from pix2pixHD.