echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.
GNU General Public License v3.0
203 stars 21 forks source link

Error: Token '4' not found in text #70

Open IMBAepsilon opened 2 months ago

IMBAepsilon commented 2 months ago

when I use

echogarden align-transcript-and-translation 01.mp3 01.txt 01_translate.txt 01.json 01.srt

I got

Echogarden v1.5.0

Start stage 1: Align speech to transcript
Transcode with command-line ffmpeg.. 1102.4ms
Convert wave buffer to raw audio.. 384.1ms
Resample audio to 16kHz mono.. 962.1ms
Crop using voice activity detection.. 1263.1ms
Normalize and trim audio.. 181.2ms
No language specified. Detect language using reference text.. 84.4ms
Language detected: Japanese (ja)
Load alignment module.. 0.2ms
Synthesize alignment reference with eSpeak.. 5911.2ms

Starting alignment pass 1/1: granularity: low, max window duration: 189s
Compute reference MFCC features.. 1069.2ms
Compute source MFCC features.. 721.3ms
DTW cost matrix memory size: 685.4MB
Align reference and source MFCC features using DTW.. 2345.1ms

Convert path to timeline.. 20.7ms
Postprocess timeline.. 54.9ms
Total alignment time: 14195.5ms

Start stage 2: Align timeline to translated transcript
No source language specified. Detect source language.. 0.9ms
Source language detected: Japanese (ja)
No target language specified. Detect target language.. 0.6ms
Target language detected: Chinese (zh)
Load e5 module
Prepare text for semantic alignment.. 331.4ms
Initialize E5 embedding model.. 1184.6ms
Extract embeddings from source 1.. Error: Token '4' not found in text
rotemdan commented 1 month ago

Thanks for the report.

The align-transcript-and-translation is a complex operation that combines alignment engines and a special word embedding model.

Due to how the text is tokenized when passed to the embedding model, it's possible that there are various edge cases where the tokenization and de-tokenization fails to match the original text.

I'll need the exact inputs used so I can reproduce the error and determine how to fix it.