MahmoudAshraf97 / ctc-forced-aligner

Text to speech alignment using CTC forced alignment
150 stars 31 forks source link

Error when downloading and using different model "jonatasgrosman/wav2vec2-large-xlsr-53-japanese" #32

Closed andriken closed 2 weeks ago

andriken commented 3 weeks ago

This is my python usage below

import torch
from ctc_forced_aligner import (
    load_audio,
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)

audio_path = "denoised_vocals.wav"
text_path = "text.txt"
language = "jpn" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16

alignment_model_name = "jonatasgrosman/wav2vec2-large-xlsr-53-japanese"  # Change this to your desired model

alignment_model, alignment_tokenizer = load_alignment_model(
    device,
    model_path=alignment_model_name,  # Pass the model name here
    dtype=torch.float16 if device == "cuda" else torch.float32,
)

audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)

with open(text_path, "r", encoding="utf-8") as f:
    lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()

emissions, stride = generate_emissions(
    alignment_model, audio_waveform, batch_size=batch_size
)

tokens_starred, text_starred = preprocess_text(
    text,
    romanize=True,
    language=language,
    # split_size="sentence",
)

segments, scores, blank_token = get_alignments(
    emissions,
    tokens_starred,
    alignment_tokenizer,
)

spans = get_spans(tokens_starred, segments, blank_token)

word_timestamps = postprocess_results(text_starred, spans, stride, scores)

print("code ran....")
print(word_timestamps)

so I got this below error, I confirm that It downloaded the model successfully and didn't do anything, then I ran again the code and got this error

(D:\ctc\cfenv) PS D:\ctc> python run.py
D:\ctc\cfenv\lib\site-packages\transformers\configuration_utils.py:306: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
D:\ctc\cfenv\lib\site-packages\transformers\configuration_utils.py:306: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
  warnings.warn(
Traceback (most recent call last):
  File "run.py", line 53, in <module>
    spans = get_spans(tokens_starred, segments, blank_token)
  File "D:\ctc\cfenv\lib\site-packages\ctc_forced_aligner\alignment_utils.py", line 63, in get_spans
    assert seg.label == ltr, f"{seg.label} != {ltr}"
AssertionError: <star> != s
MahmoudAshraf97 commented 3 weeks ago

if all the vocabulary in your text are included in the model's vocabulary, then you need to use romanize=False, romanization turns all languages into roman letters which probably do not exist in a Japanese model

andriken commented 3 weeks ago

yeah now it works, but why am i getting the output like this always [{'start': 1.22, 'end': 1.22, 'text': 'そ', 'score': 0.0}, {'start': 1.32, 'end': 1.32, 'text': 'れ', 'score': 0.0}, {'start': 1.34, 'end': 1.36, 'text': ' ', 'score': 0.0}, {'start': 1.4, 'end': 1.4, 'text': 'か', 'score': 0.0}, {'start': 1.52, 'end': 1.52, 'text': 'ら', 'score': 0.0}, {'start': 1.54, 'end': 1.56, 'text': ' ', 'score': 0.0}, {'start': 1.82, 'end': 1.82, 'text': '母', 'score': 0.0}, {'start': 1.86, 'end': 1.86, 'text': 'さ', 'score': 0.0}, {'start': 1.88, 'end': 1.9, 'text': ' ', 'score': 0.0}

shouldn't it be "text" and then the segments shown in the json output in the ReadMD file like this below

{ "text": "This is a sample text to be aligned with the audio.", "segments": [ { "start": 0.000, "end": 1.234, "text": "This" },

MahmoudAshraf97 commented 2 weeks ago

It should when outputting the results to a json file, if you are using it in python your output would be the correct one