Markfryazino / wav2lip-hq

Extension of Wav2Lip repository for processing high-quality videos.
535 stars 254 forks source link

Fix colab pretraining notebook #22

Open sokffa opened 2 years ago

sokffa commented 2 years ago

The colab pretraining notebook is not updated. There's a lot of bugs on the code (related to paths and inexistent files) https://colab.research.google.com/drive/1IUGYn-fMRbjH2IyYoAn5VKSzEkaXyP2s

youngt913 commented 2 years ago

CHECK THIS OUT https://www.youtube.com/watch?v=Kwhqj93wyXU

Twenkid commented 2 years ago

@sokffa Yes. I just tried to fix it, I managed until the last cell which had some incompatibility, outdated/mismatched versions or so which I couldn't fix yet. I tried many things, installing a proper version of basicsr, but there is a strange mismatch, one source file is different in the cloned repository (basicsr/utils/options.py) and in the installation path, with different calls to parse, one is parse(...) the other is parse_options. I couldn't fit it so far, either by installing basicsr with !python -m pip install basicsr

Or with cloning it from the repository and "setup.py install", or even a nasty copying to the python installation location: cp /content/wav2lip-hq/basicsr/utils/options.py /usr/local/lib/python3.7/dist-packages/basicsr-1.4.0-py3.7.egg/basicsr/utils/options.py (but if the versions have big difference it would probably fail in another place).

Maybe importing from the local folder (not the system's installation), that is nasty too, I guess it may work if the paths and imports are adjusted properly.

The other fixes:

!mkdir data  

for:

!mkdir data/gt
!mkdir data/lq
!mkdir data/hq

Import cv2, ignore tqdm:

import os, cv2
paths = os.listdir("data/gt")

#for img_path in tqdm(paths): 
for img_path in paths:
    img = cv2.imread("data/gt/" + img_path)
    img = cv2.resize(img, (384, 384))
    cv2.imwrite("data/hq/" + img_path, img)
davidchateau commented 2 years ago

Hello, I tried to fix the notebook, here is what I did so far:

Now I have error "Input spatial size must be 128x128, but received torch.Size([4, 3, 384, 384])" and I know that I could change images resize to 128*128 but then I have another error.

Maybe @Markfryazino has an old working environment and can give us the details of libraries versions and/or proper pth files?

Note: I'm trying both to train the model for proper lipsync AND use deepfacelab as mentioned above.

Thank you!

AIMads commented 2 years ago

Hello, I tried to fix the notebook, here is what I did so far:

Now I have error "Input spatial size must be 128x128, but received torch.Size([4, 3, 384, 384])" and I know that I could change images resize to 128*128 but then I have another error.

Maybe @Markfryazino has an old working environment and can give us the details of libraries versions and/or proper pth files?

Note: I'm trying both to train the model for proper lipsync AND use deepfacelab as mentioned above.

Thank you!

Great work! I have identified the same issues. But also stuck at the 128x128 error.

Twenkid commented 2 years ago

Good job! What is the other error you get after resizing? I think 4,3,384,384 means batch size 4, 3 channels etc. Shouldn't you resize your input to 384x384 rather than to 128, what is the size of yours? Because 128x128 doesn't sound as much HQ to me, the normal wav2lip is 96x96.

На ср, 17.08.2022 г., 22:36 ч. davidchateau @.***> написа:

Hello, I tried to fix the notebook, here is what I did so far:

-

duplicate the notebook ("file" -> "save a copy in drive")

"runtime" -> "change runtime type" -> "GPU" or else I have an error about no GPU available

add "!mkdir data" before the other "mkdir"s

downgrade torchvision to avoid deprecation warnings: !pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html

install basicsr. I downloaded the code of every previous version of the library in order to find the ones where "parse" exists in "basicsr.utils.options" -> it is versions <= 1.3.3.10. I had errors installing versions < 1.3.3.4 so I went with 1.3.3.4. !pip3 install https://files.pythonhosted.org/packages/8c/ac/74f4e34fdbc7d3d9233a6f02a740ddb446d75551fbb6ed0c4243c4511a86/basicsr-1.3.3.4.tar.gz#sha256=b448cf9efa4ff2ca75109d3aac36ef50d6e08b0bcb310ebef57ed88c09a2d2ba

create log files directory structure because I had errors about it: !mkdir /content/wav2lip-hq/experiments/ !mkdir /content/wav2lip-hq/experiments/001_ESRGAN_x4_f64b23_custom16k_500k_B16G1_wandb/

stop pretraining mode as mentioned in #17 https://github.com/Markfryazino/wav2lip-hq/issues/17 !sed -i '/resume_state/d' /content/wav2lip-hq/train_basicsr.yml

Now I have error "Input spatial size must be 128x128, but received torch.Size([4, 3, 384, 384])" and I know that I could change images resize to 128*128 but then I have another error.

Maybe @Markfryazino https://github.com/Markfryazino has an old working environment and can give us the details of libraries versions and/or proper pth files?

Note: I'm trying both to train the model for proper lipsync AND use deepfacelab as mentioned above.

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/Markfryazino/wav2lip-hq/issues/22#issuecomment-1218413753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFSI7WFVKI2JGHJTD32LYATVZU5KHANCNFSM52MQARHQ . You are receiving this because you commented.Message ID: @.***>

AIMads commented 2 years ago

Good job! What is the other error you get after resizing? I think 4,3,384,384 means batch size 4, 3 channels etc. Shouldn't you resize your input to 384x384 rather than to 128, what is the size of yours? Because 128x128 doesn't sound as much HQ to me, the normal wav2lip is 96x96. На ср, 17.08.2022 г., 22:36 ч. davidchateau @.> написа: Hello, I tried to fix the notebook, here is what I did so far: - duplicate the notebook ("file" -> "save a copy in drive") - "runtime" -> "change runtime type" -> "GPU" or else I have an error about no GPU available - add "!mkdir data" before the other "mkdir"s - downgrade torchvision to avoid deprecation warnings: !pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html - install basicsr. I downloaded the code of every previous version of the library in order to find the ones where "parse" exists in "basicsr.utils.options" -> it is versions <= 1.3.3.10. I had errors installing versions < 1.3.3.4 so I went with 1.3.3.4. !pip3 install https://files.pythonhosted.org/packages/8c/ac/74f4e34fdbc7d3d9233a6f02a740ddb446d75551fbb6ed0c4243c4511a86/basicsr-1.3.3.4.tar.gz#sha256=b448cf9efa4ff2ca75109d3aac36ef50d6e08b0bcb310ebef57ed88c09a2d2ba - create log files directory structure because I had errors about it: !mkdir /content/wav2lip-hq/experiments/ !mkdir /content/wav2lip-hq/experiments/001_ESRGAN_x4_f64b23_custom16k_500k_B16G1_wandb/ - stop pretraining mode as mentioned in #17 <#17> !sed -i '/resume_state/d' /content/wav2lip-hq/train_basicsr.yml Now I have error "Input spatial size must be 128x128, but received torch.Size([4, 3, 384, 384])" and I know that I could change images resize to 128128 but then I have another error. Maybe @Markfryazino https://github.com/Markfryazino has an old working environment and can give us the details of libraries versions and/or proper pth files? Note: I'm trying both to train the model for proper lipsync AND use deepfacelab as mentioned above. Thank you! — Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFSI7WFVKI2JGHJTD32LYATVZU5KHANCNFSM52MQARHQ . You are receiving this because you commented.Message ID: **@.***>

Yes so my idea was to resize the LQ set too 128x128 and feed them into the model instead of the GT as it made more sense and then also upsamle the HQ images to 512x512 as that is still keep the 4x difference required. My problem is the LQ images are still detected as having a size of 96x96 🙃

Twenkid commented 2 years ago

@AIMads I'll check it out when I could, but that sounds as the neural model has fixed input size (architecture) of 96x96, so I guess your desired solution wouldn't fit so simply, it may require to refactor the NN arch.

BTW, one alternative of this library is to use the basic wav2lip and then Deepfacelab with a self-generating model. I used this method for my deepfakes with my custom DFL modification for grayscale training, and that way I repair and upscale the bad and broken mouths from wav2lip to smooth 192x192 faces. My videos "Lena Schwarzenegger announces Arnold',s return in Red Heat 2 ..." https://youtu.be/4F7PB7wBEXk and this more advanced one: "Arnold reacts to ..." https://youtu.be/X56QkNzkkVM ..are produced with this method. See my comment to the latter which explains the technique. It can be applied for higher upscaling too, with respective DFL model if you have good GPU. For 192x192 (and grayscale in my use case) it looks good.

xjw00654 commented 2 years ago

The original ESRGAN is trained with GT with 128x128 or 256x256. So the Discriminative will ask the input size to be the same size. So, just do some modification as followed:

    # in /wav2lip-hq/basicsr/archs/discriminator_arch.py
    # .......
        self.linear1 = nn.Linear(num_feat * 8 * 12 * 12, 100)  # 4 * 4 ->12 * 12, where 12 = 384 // 2 **4, 4 is the downsample rate
        self.linear2 = nn.Linear(100, 1)
    def forward(self, x): 
        # assert x.size(2) == 128 and x.size(3) == 128, (f'Input spatial size must be 128x128, '
        #                                                f'but received {x.size()}.')

It will takes more memory in GPU, since we increase the input size. In this case, it consumes around 12GB memory with batchsize = 4 and Input size =384x384 image

rookiexyz commented 1 year ago

Sorry for the dumb question but what am I doing wrong? I did all the steps carefully and at the end, getting this error

Traceback (most recent call last):
  File "basicsr/train.py", line 221, in <module>
    train_pipeline(root_path)
  File "basicsr/train.py", line 132, in train_pipeline
    result = create_train_val_dataloader(opt, logger)
  File "basicsr/train.py", line 74, in create_train_val_dataloader
    train_set = build_dataset(dataset_opt)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/data/__init__.py", line 34, in build_dataset
    dataset = DATASET_REGISTRY.get(dataset_opt['type'])(dataset_opt)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/data/paired_image_dataset.py", line 65, in __init__
    self.paths = paired_paths_from_folder([self.lq_folder, self.gt_folder], ['lq', 'gt'], self.filename_tmpl)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/data/data_util.py", line 213, in paired_paths_from_folder
    assert len(input_paths) == len(gt_paths), (f'{input_key} and {gt_key} datasets have different number of images: '
AssertionError: lq and gt datasets have different number of images: 3750, 4243.
davidchateau commented 1 year ago

Hello, I fixed the training notebook. I was able to train, and also resume training (because colab notebooks don't like background processes, even if you buy compute units, so you have to train, download the models, and resume if you get disconnected) I trained a model up to iteration 100.000 but results are not good when running inference. I'm not talking about video quality, but about the quality of lip syncing. Maybe I'm doing something wrong? I'll investigate further. Here is the colab notebook, I suggest you duplicate it in your own google drive first https://colab.research.google.com/drive/1fWCy4Vri2FKrVV7q50ybL_ftzE4bz0Od?usp=sharing Please share your results (video before / after lipsync, along with the audio file, maybe the file used for training?) Regards