I am finding it difficult to perform data preprocessing on my custom video.

ds-jpg commented 8 months ago

Dear Author ,

Thank you for your amazing work.

I am really interested to inference your model on my custom dataset and currently I am unable to understand the role of the MFA and how to do inference on my custom dataset , Can you please guide me how I can do inference on a custom video ?

Thanks in advance .

GalaxyCong commented 8 months ago

Hi ds-jpg,

Thanks for your valuable feedback,

If you want to make inferences on a custom dataset, in this work, you need the following data: d-vector of reference audio, phonemes, and visual features (feature sequence of lips, feature sequence of face, feature sequence of scene).

Reference audio d-vector: This code script is provided by V2C-Net based on GE2E.
Phonemes: in our model, we used phonemes as input instead of raw text. Each phoneme and its duration is saved in *.TextGrid. In other words, our phonemes are directly loaded from the phonemes in *.TextGrid processed by MFA. Besides, *.TextGrid is further used in the data preprocessing to get the mel, pitch, energy, and stats.json.
Visual features (lip feature sequence, face feature sequence): first, you need to obtain the lip and face regions cropped in the original frames, more details. Next, use the EmoFAN and LipEncoder to get the facial and lip-motion feature sequences, respectively.
Visual features (feature sequences of scenes): This code script is provided by V2C-Net based on the I3D model.

Next work, we will release a complete end-to-end model, eliminating the need for multi-pre-processing. Besides, currently, the scale of V2C and Chem (single speaker dataset) is relatively small, so I'm afraid to afford the generalization effects of dubbing quality in custom clips. We plan to train in the wild datasets like LRS2, LRS3, and LRW and try to build more large dubbing datasets, especially for real film scenes. Please stay tuned.

Best,

issuer2002 commented 8 months ago

Hi ds-jpg,

Thanks for your valuable feedback,

If you want to make inferences on a custom dataset, in this work, you need the following data: d-vector of reference audio, phonemes, and visual features (feature sequence of lips, feature sequence of face, feature sequence of scene).

Reference audio d-vector: This code script is provided by V2C-Net based on GE2E.

Phonemes: in our model, we used phonemes as input instead of raw text. Each phoneme and its duration is saved in *.TextGrid. In other words, our phonemes are directly loaded from the phonemes in *.TextGrid processed by MFA. Besides, *.TextGrid is further used in the data preprocessing to get the mel, pitch, energy, and stats.json.

Visual features (lip feature sequence, face feature sequence): first, you need to obtain the lip and face regions cropped in the original frames, more details. Next, use the EmoFAN and LipEncoder to get the facial and lip-motion feature sequences, respectively.

Visual features (feature sequences of scenes): This code script is provided by V2C-Net based on the I3D model.

Next work, we will release a complete end-to-end model, eliminating the need for multi-pre-processing. Besides, currently, the scale of V2C and Chem (single speaker dataset) is relatively small, so I'm afraid to afford the generalization effects of dubbing quality in custom clips. We plan to train in the wild datasets like LRS2, LRS3, and LRW and try to build more large dubbing datasets, especially for real film scenes. Please stay tuned.

Best,

请问 end-to-end model大概什么时候release 呀

GalaxyCong commented 7 months ago

Hi ds-jpg, Thanks for your valuable feedback, If you want to make inferences on a custom dataset, in this work, you need the following data: d-vector of reference audio, phonemes, and visual features (feature sequence of lips, feature sequence of face, feature sequence of scene).

Reference audio d-vector: This code script is provided by V2C-Net based on GE2E.

Phonemes: in our model, we used phonemes as input instead of raw text. Each phoneme and its duration is saved in *.TextGrid. In other words, our phonemes are directly loaded from the phonemes in *.TextGrid processed by MFA. Besides, *.TextGrid is further used in the data preprocessing to get the mel, pitch, energy, and stats.json.

Visual features (lip feature sequence, face feature sequence): first, you need to obtain the lip and face regions cropped in the original frames, more details. Next, use the EmoFAN and LipEncoder to get the facial and lip-motion feature sequences, respectively.

Visual features (feature sequences of scenes): This code script is provided by V2C-Net based on the I3D model.

Next work, we will release a complete end-to-end model, eliminating the need for multi-pre-processing. Besides, currently, the scale of V2C and Chem (single speaker dataset) is relatively small, so I'm afraid to afford the generalization effects of dubbing quality in custom clips. We plan to train in the wild datasets like LRS2, LRS3, and LRW and try to build more large dubbing datasets, especially for real film scenes. Please stay tuned. Best,

请问 end-to-end model大概什么时候release 呀

可能在今年下半年哈，非常感谢您的关注！👍

isxuwl commented 4 months ago

Hi ds-jpg, Thanks for your valuable feedback, If you want to make inferences on a custom dataset, in this work, you need the following data: d-vector of reference audio, phonemes, and visual features (feature sequence of lips, feature sequence of face, feature sequence of scene).

Reference audio d-vector: This code script is provided by V2C-Net based on GE2E.

Phonemes: in our model, we used phonemes as input instead of raw text. Each phoneme and its duration is saved in *.TextGrid. In other words, our phonemes are directly loaded from the phonemes in *.TextGrid processed by MFA. Besides, *.TextGrid is further used in the data preprocessing to get the mel, pitch, energy, and stats.json.

Visual features (lip feature sequence, face feature sequence): first, you need to obtain the lip and face regions cropped in the original frames, more details. Next, use the EmoFAN and LipEncoder to get the facial and lip-motion feature sequences, respectively.

Visual features (feature sequences of scenes): This code script is provided by V2C-Net based on the I3D model.

Next work, we will release a complete end-to-end model, eliminating the need for multi-pre-processing. Besides, currently, the scale of V2C and Chem (single speaker dataset) is relatively small, so I'm afraid to afford the generalization effects of dubbing quality in custom clips. We plan to train in the wild datasets like LRS2, LRS3, and LRW and try to build more large dubbing datasets, especially for real film scenes. Please stay tuned. Best,

请问 end-to-end model大概什么时候release 呀

可能在今年下半年哈，非常感谢您的关注！👍

I'm very much looking forward to your next step and hope to see it soon!

GalaxyCong / HPMDubbing

I am finding it difficult to perform data preprocessing on my custom video. #7