FuxiVirtualHuman / styletalk

MIT License
507 stars 50 forks source link

How to extract the phonemes? #11

Open thomas-endres-tng opened 1 year ago

thomas-endres-tng commented 1 year ago

Unfortunately your reference concerning phonemes does not provide a reference other than the link to CMU Sphinx.

I did a bit of research and ended up with the following code:

def create_phoneme(audio_wave_file):
    with wave.open(audio_wave_file, "rb") as audio:
        decoder = Decoder(samprate=audio.getframerate(), allphone=ps.get_model_path("en-us/en-us-phone.lm.bin"))
        decoder.start_utt()
        decoder.process_raw(audio.getfp().read(), full_utt=True)
        decoder.end_utt()

    input_phoneme_list = []
    if decoder.hyp():
        segments = decoder.seg()
        for seg in segments:
            input_phoneme_list.append({'phone': seg.word, 'phone_end_frame': seg.end_frame})
    else:
        raise Exception('Phoneme recognition failed')

    total_number_of_frames_in_audio = int(input_phoneme_list[-1]['phone_end_frame'] / 100 * ASSUMED_FRAME_RATE)
    print(total_number_of_frames_in_audio)

    frame_index = 0
    phone_list = []
    phone_index = 0

    while frame_index < total_number_of_frames_in_audio:
        if (frame_index * 100 / ASSUMED_FRAME_RATE) < input_phoneme_list[phone_index]['phone_end_frame']:
            phone_list.append(input_phoneme_list[phone_index]['phone'])
            frame_index += 1
        else:
            phone_index += 1

    with open(str("phindex.json")) as f:
        ph2index = json.load(f)
    phonemes = []
    for p in phone_list:
        if p in ph2index:
            phonemes.append(ph2index[p])
        else:
            print(f"Weird Phoneme found: {p}. Ignoring...")
            phonemes.append(31) # Silence

    phone_list = phonemes

    print("Phoneme generation done")

    return phone_list

I'm using the phindex.json file from https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face/blob/main/phindex.json and a ASSUMED_FRAME_RATE of 30 (this seems to match the number of phonemes you have in the samples rather than 25 as referenced in the papers).

However my phonemes look a lot different as compared to your samples for the sample wave files. What am I doing wrong?

BeauGeogeo commented 1 year ago

Hello !

Same problem for me. I also assumed 30 fps and modify a bit the function from AVCT to parse the phonemes.

The phoneme parsing does work well but as @thomas-endres-tng mentionned it the phonemes are very different. I use pretty much the same code. I also tried a code a bit different, doing segmentation first and processing each segement then with full_utt=False but anyway it was not as good as AVCT.

I tried different versions of the model language, included the most updated of course, and Im unable to have the same words nor the same phonemes. For word recognition on the AdamSchiff.wav of AVCT, the result I get still miss some words and "Putin" is always replaced by "proven" whereas in AVCT json file it appears correctly.

Regarding the fact that I planned to use styleTALK for my internship project, any help would be much appreciated !

Thank you very much :)