YoungSeng / DiffuseStyleGesture

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models (IJCAI 2023) | The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 (ICMI 2023, Reproducibility Award)
MIT License
156 stars 21 forks source link

Transition between generated gesture #1

Open YoungSeng opened 1 year ago

YoungSeng commented 1 year ago

The segments we trained are all 4s long, and it is difficult to generalize to arbitrary length gestures by positional encoding alone. MDM-based models that require time-awareness (arbitrarily long inference) require a smooth transition between the generated sequences. The following practices can be referred to:

  1. Our approach is to add seed poses for smooth transitions.
  2. Its follow-up work PriorMDM uses DoubleTake for long motion generation.
  3. EDGE enforces temporal consistency between multiple sequences.
sh-taheri commented 1 year ago

Hi, thanks for the great work!

Regarding your approach:

I am wondering if this is a bug in sample.py when smoothing the transitions here:

As you have commented yourself, the size of varaible last_poses is (1, model.njoints, 1, args.n_seed), so len(last_poses) is always 1. I think len(last_poses) should be replaced with np.size(last_poses, axis=-1) which is args.n_seed (30 frames by default). This way, it combines the first frames of the new prediction with the last frames of previous prediction, something like this:

for j in range(np.size(last_poses, axis=-1)): n = np.size(last_poses, axis=-1) prev = last_poses[..., j] next = sample[..., j] sample[..., j] = prev (n - j) / (n + 1) + next (j + 1) / (n + 1)

Am I right? Would appreciate your feedback. Thanks a lot

YoungSeng commented 1 year ago

Yes, when I reproduced it later I remembered that there was a minor problem in this region, but it didn't seem to have much effect on the results. Also:

  1. the length of last_poses is not 1, but n_seed, where the first 1 indicates the batch size and the second 1 extends the dimensions, which has no real meaning.
  2. the follow-up DiffuseStyleGesture+ definitely fixed this, see: here.