choijeongsoo / lip2speech-unit

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units
Other
21 stars 2 forks source link

Installations and code running #5

Open Buicongbang04 opened 1 month ago

Buicongbang04 commented 1 month ago

Hi sir, can you descibe more specific the step to run this repo. I followed your instruction but it did not work. Do I need to fix the path in file config and .rh ? And do I need to install data and checkpoints?

Hope you answer this soon.

choijeongsoo commented 1 month ago

Hello, thank you for your interest in our work.

We have updated our code and added more details. After downloading the required checkpoints into the 'checkpoints' directory, you can use the following commands for inference:

cd multi_target_lip2speech && bash scripts/lrs3/inference.sh && cd .. cd multi_input_vocoder && bash scripts/lrs3/inference.sh && cd .. or cd multi_target_lip2speech && bash scripts/lrs3/inference_avhubert.sh && cd .. cd multi_input_vocoder && bash scripts/lrs3/inference_aug.sh && cd ..

The results can be found in the 'results/lrs3' directory. If you encounter any issues, please let us know where the error occurred.

Buicongbang04 commented 1 month ago

@choijeongsoo Thanks for your great work here. Sorry, but I have a question,, can you show me how to preprocess my custom data to run with your model.

choijeongsoo commented 2 weeks ago

I'm sorry for late reply.

For inference, you need a lip region video and a speaker embedding from a sample speech.

  1. lip video We followed preparation steps in https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation. You can also check https://github.com/facebookresearch/muavic for more information.
  2. speaker embedding We followed https://github.com/CorentinJ/Real-Time-Voice-Cloning. You can also check model_speaker_encoder.py and encoder.pt in https://github.com/choijeongsoo/av2av/tree/main/unit2av.

We plan to provide a complete pipeline for generating output from random videos and sample speech, but I think it will be a bit difficult for the time being.

Buicongbang04 commented 2 weeks ago

{A64E368A-64D5-4A26-B48D-9CDF93DE7CC0} Do you mean speaker embedding looks like this? I followed an avhubert repo and this repo generated for me a lip video and extracted features from that video, in tensor form as shown in the picture. Is this the content of the .unt file or not, and what is the meaning of the dict.unt.txt file? Hope to receive your response soon. Thanks!

choijeongsoo commented 2 weeks ago

The speaker embedding vector looks like [1 x 256] for one utterance. If I remember correctly, it is processed through a ReLU activation and then l2 normalized after pooled to be a single vector.

.unt file for speech unit is similar to .wrd file for subword in the avhubert repo. we used dictionary size of 200 and dict.unt.txt file will contain 200 lines that represent each speech unit (0, 1, ..., 199)