How do I increase the quality of the rendered video?

IronclawFTW commented 3 years ago

Hi, my first time here. Didn't find anywhere else to post, so sorry if this is the wrong place.

I just discovered Wav2Lip and I love it. And I know the pre-learned model thingy we get to use was made on low resolution stuff, thus things will not look highres or sharp. I've created many lip sync tests, different resolution sources and the mouth is always lowres and blurry in comparison. And yes, the "--resize_factor" does kinda make it look "better" as the lowres mouth/lips matches better if the source video gets a lower res/quality.

Anyway, as I don't know how to train my own models (not found a single tutorial, which my dumb brain would need... so can't make a higher res model for myself), I would at least like to know what file to edit (as I assume that's how it would be done) to increase the output file's quality, as right now it's very low. Seem like 2-3mb/s. So, for example, if the source material is in 1080p at 60mb/s, the output file will look terrible in comparison, so blocky/blurry, loads of artifacts all over. Nope, I'm not using ""--resize_factor" in this case as I want to keep the original video resolution.

Getting my RTX 3090 in about 2 weeks, which is to no use for me with lip syncing as I'm only using the online "[Colab Notebook]" thingy as I don't know how to install it all on my own system. Will use it for some nice DeepFakes tho using DeepFakeLab which was a breeze to install (pretty much just unzip one file and it all works).

So, anyway, was thinking of trying to combine DeepFakes with lip syncing for some fun videos. Like Duke Nukem's face on some muscular dude doing funny stuff, while his lips is synced to Duke Nukem voice samples, etc.

prajwalkr commented 3 years ago

You will need to train on a good high-res dataset. See this for some tips. You need to change the model architecture to handle high-res inputs. You should appropriately add more layers and downsampling operations to get a 1-D embedding at the end of the encoders. More upsampling layers at the end of the decoder to increase the resolution. This is the simplest way of doing it, I guess. We have not tried it, so we cannot assure you that it is the way to go. But, enjoy experimenting! :100:

prajwalkr commented 3 years ago

the files in the models directory are the ones that require main changes.

IronclawFTW commented 3 years ago

You will need to train on a good high-res dataset. See this for some tips. You need to change the model architecture to handle high-res inputs. You should appropriately add more layers and downsampling operations to get a 1-D embedding at the end of the encoders. More upsampling layers at the end of the decoder to increase the resolution. This is the simplest way of doing it, I guess. We have not tried it, so we cannot assure you that it is the way to go. But, enjoy experimenting! 💯

I've already been there, read it all, and like I said before, can't find a proper tutorial showing how it's done anywhere on the internet. Dunno if what you linked is enough for most people, but to me it doesn't make sense, sorry. Also, like I said, I'm only using the "[Colab Notebook]" place, not my own on my computer as I don't know how to set it all up, as what's suggested in the link might require my own thingy. Or maybe this thingy you suggested can be done in the "[Colab Notebook]"?

Can you indicate specifically which files would require changes and then give a before-after example of one such change? That would be really helpful

I don't know which files need modifying, hence my question about which files. I'm talking about the output quality of the "result_voice.mp4", I need that one to be in higher quality (higher bitrate etc) than it is now by default. When I find out, it will be posting among these comments, so just check back here for any status :)

the files in the models directory are the ones that require main changes.

I've read through the 4 files there and don't see any parameter/command that deals with the video output quality (bitrate, etc).

In the output window where a bunch of stuff is added while it's creating the output file, these lines are controlling the quality/bitrate, I think, but where are these commands located, and what of all this stuff is needed to be changed for a higher quality output?:

[libx264 @ 0x55909bc7cd00] -qscale is ignored, -crf is recommended. [libx264 @ 0x55909bc7cd00] using SAR=1/1 [libx264 @ 0x55909bc7cd00] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 [libx264 @ 0x55909bc7cd00] profile High, level 4.0 [libx264 @ 0x55909bc7cd00] 264 - core 152 r2854 e9a5903 - H.264/MPEG-4 AVC codec - Copyleft 2003-2017 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=3 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00

rebotnix commented 3 years ago

Hi, I do not think that the bitrate change will give you better output quality. The generated faces are generated by a low-resolution model (96px). First, we have to fix that to get better output quality. You can tweak a little bit of the global output for not having the compressing artifacts, but in the faces, you will not see an improvement.

michaelbuddy commented 3 years ago

the files in the models directory are the ones that require main changes.

another question... Is there any way to speed up the inference? Can TensorRT help on this?

prajwalkr commented 3 years ago

I do not know about TensorRT but I can tell you that currently, the major bottleneck is the face detection part. If you have the faces detected, then the rest of the process is real-time.

Crazyjoedevola commented 3 years ago

Are you planning on releasing a HQ version soon?

rebotnix commented 3 years ago

the files in the models directory are the ones that require main changes.

another question... Is there any way to speed up the inference? Can TensorRT help with this?

Yes, TensoRT will help to speed up when we use int8 on special tensorRT core. Another idea is to use split and segment different parts. I think we have to write some custom cuda tensorRT kernel for this model, but it's definitely possible.

As you can see in the nvidia_smi the whole cores are not used and there is also a lot of potential for optimizations.

Before all of them, we have to create a new HQ model.

IronclawFTW commented 3 years ago

Hi, I do not think that the bitrate change will give you better output quality. The generated faces are generated by a low-resolution model (96px). First, we have to fix that to get better output quality. You can tweak a little bit of the global output for not having the compressing artifacts, but in the faces, you will not see an improvement.

I know, I just want the video I'm using to come out with the same quality. I know I can't do much about the lips. This was mentioned on my post. I was already aware of the low-res lips no matter video source quality and output quality I use. I just want to know how to increase the quality of the output video. Like, if the source video I use is sharp and is like in 60mb/s and we can even see each snowflake falling down, when it's been exported by the script, it's in like 3mb/s and no snowflakes can be seen because now it's full of artifacts, blocky and blurry because of the terrible compression that is set somewhere.

rebotnix commented 3 years ago

I see now. I try to test it with RAW export and maybe better compression settings, we can pipe RAW frames to FFMpeg and then set manual the settings for the used codec.

dipam7 commented 3 years ago

I do not know about TensorRT but I can tell you that currently, the major bottleneck is the face detection part. If you have the faces detected, then the rest of the process is real-time.

If I am not wrong, every time we do inference, we detect faces in the video. Suppose I want to use the same video every time for inference, with the same duration of audio (1 min audios only for example), then can I save the detected faces and reuse them to remove the bottleneck?

prajwalkr commented 3 years ago

then can I save the detected faces and reuse them to remove the bottleneck?

Yes, you can change the code to do that.

Rudrabha / Wav2Lip

How do I increase the quality of the rendered video? #84