TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Other
1.85k stars 224 forks source link

单步生成 #55

Closed chunyu-li closed 2 months ago

chunyu-li commented 2 months ago
for i, (whisper_batch, latent_batch) in enumerate(
    tqdm(gen, total=int(np.ceil(float(video_num) / batch_size)))
):
    audio_feature_batch = torch.from_numpy(whisper_batch)
    audio_feature_batch = audio_feature_batch.to(
        device=unet.device, dtype=unet.model.dtype
    )  # torch, B, 5*N,384
    audio_feature_batch = pe(audio_feature_batch)
    latent_batch = latent_batch.to(dtype=unet.model.dtype)

    pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
    recon = vae.decode_latents(pred_latents)
    for res_frame in recon:
        res_frame_list.append(res_frame)

看代码似乎生成每一帧的时候 Unet 只 forward 了一次?我的理解正确吗,那这还算是扩散模型吗

itechmusic commented 2 months ago

Hello, 正如我们在README里写到,MuseTalk虽然使用了sd1.5的模型结构(和vae权重),但它不是扩散模型。https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#model

chunyu-li commented 2 months ago

Hello, 正如我们在README里写到,MuseTalk虽然使用了sd1.5的模型结构(和vae权重),但它不是扩散模型。https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#model

By the way,你们使用了 sd1.5 的模型权重吗?还是说你们是 from scratch 训练的 unet?

jinqinn commented 2 months ago

Hello, 正如我们在README里写到,MuseTalk虽然使用了sd1.5的模型结构(和vae权重),但它不是扩散模型。https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#model

使用了vae + unet模型来进行了推理,并没有使用random seed来扩散,所以不叫扩散模型对吧?

chunyu-li commented 2 months ago

Hello, 正如我们在README里写到,MuseTalk虽然使用了sd1.5的模型结构(和vae权重),但它不是扩散模型。https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#model

使用了vae + unet模型来进行了推理,并没有使用random seed来扩散,所以不叫扩散模型对吧?

扩散模型的核心是多步生成,跟有没有 random seed 没关系,GAN和VAE生成过程也有random seed

itechmusic commented 2 months ago

Hello, 正如我们在README里写到,MuseTalk虽然使用了sd1.5的模型结构(和vae权重),但它不是扩散模型。https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#model

By the way,你们使用了 sd1.5 的模型权重吗?还是说你们是 from scratch 训练的 unet?

unet的权重是from scratch的,主要考虑到sd的unet的output是噪声而不是有意义的latent