AudioLDM

Get GPT to create a text-to-audio prompt for background sounds/music appropriate for the response & text-to-image prompt
AudioLDM will generate some background sounds
Add the generated background sounds during the appropriate response

wav2lip

Input is audio and generated images
Match spoken words to lip movements with wav2lip

Video Driven Portrait Animation

Get GPT to give mood/style/emotion tags for the response
Options
- Find matching videos from a large video corpus (we can do the tagging ourselves using something like this)
- Find a large corpus that's already tagged
- (6 months from now) use a text-to-video model to generate the driving videos

dtedesco1 commented 1 year ago

More thoughts from Leo:

"除了使用我们提供的视频进行训练外，您还可以自己录制视频，为自己训练一个独一无二的GeneFace虚拟人模型！" GeneFace 不能作为one-shot的方式使用, 需要搭配其它模型使用, 如果我们要通过声音驱动林肯说话, 则需要先在GeneFace上面进行train, 才能得到一个林肯的GeneFace模型, train的时候需要提供一个以上的林肯视频才行, 不过这个倒是可以使用video driven的模型通过单张照实现;

训练好了之后, 进行推理可能会比较消耗资源 : "基于NeRF的图像渲染器的推理过程相对较慢(使用RTX2080Ti渲染250帧512x512分辨率的图像需要大约2个小时)"

这个项目完成了一个数字人最后的版图, 牛比了, 大致的过程如下:

使用SD生产单个照片,
使用video driven的模型通过单张照实现一个人的视频
生成的视频作为样本, 有GeneFace训练虚拟人模型
通过GBT生成文案, 然后再有文案得到声音 5, 由声音驱动GeneFace的模型进行说话表演

甚至可以通过https://www.resemble.ai/ (Your Complete Generative Voice AI Toolkit) 进行声音克隆, 那就更真实了, 比如我们以奥巴马为例子, 只需要他的一张照片, 一段音频; 就可以得到一个奥巴马的数字人形象

dtedesco1 / autovideos

V2 with Animations #2

AudioLDM

wav2lip

Video Driven Portrait Animation