Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.74k stars 2.29k forks source link

[Help needed!] There's clear box region around the mouth when using personal video #415

Open liuquande opened 2 years ago

liuquande commented 2 years ago

Dear author,

Thanks for sharing the excellent work.

I found that when using my personal video, there is a clear box region around the mouth in the output result, see as below: image

What could be the reason of this, and could you please give me some instruction on how to solve it?

Many thanks for the help.

Best.

alexanderj1988 commented 2 years ago

have the same problem here

NikitaKononov commented 2 years ago

The problem is that model has 96x96 resolution. So it downscales face square and than upscales to fit your source video. There's no solution. You can only train hi-res model )

Curisan commented 1 year ago

@liuquande I train the model use AVSpeech and meet the same problem. Do you use the pretrained model?

YinonDouchanClarity commented 1 year ago

The reason this is happening is because the detected face image is resized to 96x96 before being inputted to the network, and the outputted lip-synced face, which is also 96x96, is being resized back to the original dimensions of the face. If the original face is larger than 96x96, it might cause those blurry edges due to up-sampling the lip-sync (probably by interpolation). The bigger the difference in dimensions between the original face and the 96x96 input dimensions, the more conspicuous those edges will be. I see two straightforward ways to deal with it:

  1. Downsample input frames using the --resize_factor argument. While this will reduce video resolution, it will reduce face dimensions and mitigate the square effect.
  2. (requires modifying code) Calculate a face mask for each frame so you'll know to paste only the outputted lip-synced face and not its surrounding. There are many libraries that can do that in different ways such as MediaPipe and face-parsing.

I implemented (2) and it completely eliminated the square while keeping the lip-synced face intact.

stevin-dong commented 1 year ago

The reason this is happening is because the detected face image is resized to 96x96 before being inputted to the network, and the outputted lip-synced face, which is also 96x96, is being resized back to the original dimensions of the face. If the original face is larger than 96x96, it might cause those blurry edges due to up-sampling the lip-sync (probably by interpolation). The bigger the difference in dimensions between the original face and the 96x96 input dimensions, the more conspicuous those edges will be. I see two straightforward ways to deal with it:

  1. Downsample input frames using the --resize_factor argument. While this will reduce video resolution, it will reduce face dimensions and mitigate the square effect.
  2. (requires modifying code) Calculate a face mask for each frame so you'll know to paste only the outputted lip-synced face and not its surrounding. There are many libraries that can do that in different ways such as MediaPipe and face-parsing.

I implemented (2) and it completely eliminated the square while keeping the lip-synced face intact.

Could you please share the modified code? I'm a beginner, and this issue has been bothering me for a long time. Thank you very much!

YinonDouchanClarity commented 1 year ago

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

    parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                        type=str, help='Path to face landmarks detector')
    parser.add_argument('--with_face_mask', action='store_true',
                        help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

                mask = face_mask_from_image(p, face_landmarks_detector)
                f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
            else:
                f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
    """
    Calculate face mask from image. This is done by

    Args:
        image: numpy array of an image
        face_landmarks_detector: mediapipa face landmarks detector
    Returns:
        A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
    """
    # initialize mask
    mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

    # detect face landmarks
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
    detection = face_landmarks_detector.detect(mp_image)

    if len(detection.face_landmarks) == 0:
        # no face detected - set mask to all of the image
        mask[:] = 1
        return mask

    # extract landmarks coordinates
    face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

    # calculate convex hull from face coordinates
    convex_hull = cv2.convexHull(face_coords.astype(np.float32))

    # apply convex hull to mask
    return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)
stevin-dong commented 1 year ago

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

  parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                      type=str, help='Path to face landmarks detector')
  parser.add_argument('--with_face_mask', action='store_true',
                      help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

              mask = face_mask_from_image(p, face_landmarks_detector)
              f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
          else:
              f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
  """
  Calculate face mask from image. This is done by

  Args:
      image: numpy array of an image
      face_landmarks_detector: mediapipa face landmarks detector
  Returns:
      A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
  """
  # initialize mask
  mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

  # detect face landmarks
  mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
  detection = face_landmarks_detector.detect(mp_image)

  if len(detection.face_landmarks) == 0:
      # no face detected - set mask to all of the image
      mask[:] = 1
      return mask

  # extract landmarks coordinates
  face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

  # calculate convex hull from face coordinates
  convex_hull = cv2.convexHull(face_coords.astype(np.float32))

  # apply convex hull to mask
  return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

Thank you very much for your help!

liumaokun2022 commented 1 year ago

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

  parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                      type=str, help='Path to face landmarks detector')
  parser.add_argument('--with_face_mask', action='store_true',
                      help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

              mask = face_mask_from_image(p, face_landmarks_detector)
              f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
          else:
              f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
  """
  Calculate face mask from image. This is done by

  Args:
      image: numpy array of an image
      face_landmarks_detector: mediapipa face landmarks detector
  Returns:
      A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
  """
  # initialize mask
  mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

  # detect face landmarks
  mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
  detection = face_landmarks_detector.detect(mp_image)

  if len(detection.face_landmarks) == 0:
      # no face detected - set mask to all of the image
      mask[:] = 1
      return mask

  # extract landmarks coordinates
  face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

  # calculate convex hull from face coordinates
  convex_hull = cv2.convexHull(face_coords.astype(np.float32))

  # apply convex hull to mask
  return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

I try these code in inferrence.py,but return with error: face_landmarks_detector has not been defined,can you show how to create this object with mediapipe?

sailorsale commented 1 year ago

我发送代码有点问题。不过,我可以发送修改内容。将它们作为粗略的指导方针而不是精确的指导方针。了解如何使用 MediaPipe。

安装 MediaPipe:

pip install mediapipe==0.10.0

下载人脸地标模型并将其放入“weights”文件夹中:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

添加 MediaPipe 导入:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

添加面罩参数:

  parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                      type=str, help='Path to face landmarks detector')
  parser.add_argument('--with_face_mask', action='store_true',
                      help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

在 main() 中以“for p, f, c in ...”开头的循环中将该行修改为f[y1:y2, x1:x2] = p

              mask = face_mask_from_image(p, face_landmarks_detector)
              f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
          else:
              f[y1:y2, x1:x2] = p

添加face_mask_from_image函数:

def face_mask_from_image(image, face_landmarks_detector):
  """
  Calculate face mask from image. This is done by

  Args:
      image: numpy array of an image
      face_landmarks_detector: mediapipa face landmarks detector
  Returns:
      A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
  """
  # initialize mask
  mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

  # detect face landmarks
  mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
  detection = face_landmarks_detector.detect(mp_image)

  if len(detection.face_landmarks) == 0:
      # no face detected - set mask to all of the image
      mask[:] = 1
      return mask

  # extract landmarks coordinates
  face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

  # calculate convex hull from face coordinates
  convex_hull = cv2.convexHull(face_coords.astype(np.float32))

  # apply convex hull to mask
  return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

I try these code in inferrence.py,but return with error: face_landmarks_detector has not been defined,can you show how to create this object with mediapipe?

dizhenx commented 1 year ago

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

  parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                      type=str, help='Path to face landmarks detector')
  parser.add_argument('--with_face_mask', action='store_true',
                      help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

              mask = face_mask_from_image(p, face_landmarks_detector)
              f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
          else:
              f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
  """
  Calculate face mask from image. This is done by

  Args:
      image: numpy array of an image
      face_landmarks_detector: mediapipa face landmarks detector
  Returns:
      A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
  """
  # initialize mask
  mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

  # detect face landmarks
  mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
  detection = face_landmarks_detector.detect(mp_image)

  if len(detection.face_landmarks) == 0:
      # no face detected - set mask to all of the image
      mask[:] = 1
      return mask

  # extract landmarks coordinates
  face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

  # calculate convex hull from face coordinates
  convex_hull = cv2.convexHull(face_coords.astype(np.float32))

  # apply convex hull to mask
  return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

mask = face_mask_from_image(p, face_landmarks_detector) f[y1:y2, x1:x2] = f[y1:y2, x1:x2] (1 - mask[..., None]) + p mask[..., None] else: f[y1:y2, x1:x2] = p This code is incomplete, since there is else, why is there no if in front?

AIFSH commented 11 months ago

It's a bit problematic for me to send the code. However, I can send the modifications. Take them as rough guidelines rather than exact. Read about how to use MediaPipe.

Install MediaPipe:

pip install mediapipe==0.10.0

Download face landmarker model and put it in the "weights" folder:

wget -O weights/face_landmarker_v2_with_blendshapes.task -q https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task

Add MediaPipe imports:

from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import mediapipe as mp

Add face mask args:

  parser.add_argument('--face_landmarks_detector_path', default='weights/face_landmarker_v2_with_blendshapes.task',
                      type=str, help='Path to face landmarks detector')
  parser.add_argument('--with_face_mask', action='store_true',
                      help='Blend output into original frame using a face mask rather than directly blending the face box. This prevents a lower resolution square artifact around lower face')

In main() in the loop starting with "for p, f, c in ..." Modify The line f[y1:y2, x1:x2] = p to:

              mask = face_mask_from_image(p, face_landmarks_detector)
              f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
          else:
              f[y1:y2, x1:x2] = p

Add face_mask_from_image function:

def face_mask_from_image(image, face_landmarks_detector):
  """
  Calculate face mask from image. This is done by

  Args:
      image: numpy array of an image
      face_landmarks_detector: mediapipa face landmarks detector
  Returns:
      A uint8 numpy array with the same height and width of the input image, containing a binary mask of the face in the image
  """
  # initialize mask
  mask = np.zeros((image.shape[0], image.shape[1]), dtype=np.uint8)

  # detect face landmarks
  mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
  detection = face_landmarks_detector.detect(mp_image)

  if len(detection.face_landmarks) == 0:
      # no face detected - set mask to all of the image
      mask[:] = 1
      return mask

  # extract landmarks coordinates
  face_coords = np.array([[lm.x * image.shape[1], lm.y * image.shape[0]] for lm in detection.face_landmarks[0]])

  # calculate convex hull from face coordinates
  convex_hull = cv2.convexHull(face_coords.astype(np.float32))

  # apply convex hull to mask
  return cv2.fillPoly(mask, pts=[convex_hull.squeeze().astype(np.int32)], color=1)

thanks,it's work!

Crestina2001 commented 11 months ago

Hi. For those wondering that how to import face_landmarks_detector, here is the code: ` BaseOptions = mp.tasks.BaseOptions FaceLandmarker = mp.tasks.vision.FaceLandmarker FaceLandmarkerOptions = mp.tasks.vision.FaceLandmarkerOptions VisionRunningMode = mp.tasks.vision.RunningMode

options = FaceLandmarkerOptions( base_options=BaseOptions(model_asset_path=args.face_landmarks_detector_path), running_mode=VisionRunningMode.IMAGE)

ace_landmarks_detector = FaceLandmarker.create_from_options(options) ` But this won't necessarily solve the problem, because it will cause the face edge to be incompatible. The only solution may be to use high-res images to train.

Crestina2001 commented 11 months ago

Having a --resize_factor seems not to work, when my videos are already of 720P.

EricKong1985 commented 11 months ago

hit the same problem

vari-sh commented 1 month ago

Don't know if still useful, however i managed to make it work:

        BaseOptions = mp.tasks.BaseOptions
        FaceLandmarker = mp.tasks.vision.FaceLandmarker
        FaceLandmarkerOptions = mp.tasks.vision.FaceLandmarkerOptions
        VisionRunningMode = mp.tasks.vision.RunningMode

        options = FaceLandmarkerOptions(
        base_options=BaseOptions(model_asset_path=args.face_landmarks_detector_path),
        running_mode=VisionRunningMode.IMAGE) 

        face_landmarks_detector = FaceLandmarker.create_from_options(options)

        for p, f, c in zip(pred, frames, coords):
            y1, y2, x1, x2 = c
            p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))

            if args.with_face_mask:
                mask = face_mask_from_image(p, face_landmarks_detector)
                f[y1:y2, x1:x2] = f[y1:y2, x1:x2] * (1 - mask[..., None]) + p * mask[..., None]
            else:
                f[y1:y2, x1:x2] = p
            out.write(f)

for me it was really important to remove tqdm, also removing the import (on Windows) otherwise I had WinError 6 Handle not valid, I think something about multithreading, don't want to know...

Hope it helps ✌️