Extract alignment from textgrid

roedoejet commented 1 month ago

PR Goal?

In writing the documentation for everyvoice segment I realized that we didn't actually have an easy way of extracting the text/audio intervals from the textgrid into the format needed by Everyvoice. This PR adds that feature and also updates the documentation.

Fixes?

https://github.com/EveryVoiceTTS/EveryVoice/issues/543 https://github.com/EveryVoiceTTS/EveryVoice/issues/544

Feedback sought?

Sanity. Suggest any changes to the CLI method names or documentation.

Priority?

medium

Tests added?

How to test?

For this to work you need a plain text transcript and some corresponding audio. You can then run the segmenter: everyvoice segment align path_to_text.txt path_to_audio.wav. You can then install Praat and use it to inspect the .TextGrid file that was generated, and adjust any alignments as necessary. Once you are happy with your alignments, you can use everyvoice segment extract path_to_alignment.TextGrid path_to_audio.wav outdir which will then create a folder called outdir with your audio, and a metadata file containing references to each of your audio files and the corresponding text.

Confidence?

medium

Version change?

new alpha release

semanticdiff-com[bot] commented 1 month ago

Review changes with SemanticDiff.

Analyzed 2 of 4 files.

Overall, the semantic diff is 13% smaller than the GitHub diff.

	Filename	Status
:heavy_check_mark:	everyvoice/cli.py	13.44% smaller
:heavy_check_mark:	everyvoice/model/aligner/wav2vec2aligner	Analyzed
:grey_question:	docs/guides/custom.md	Unsupported file format
:grey_question:	docs/guides/finetune.md	Unsupported file format

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Please upload report for BASE (main@3b20c2e). Learn more about missing BASE report.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #564 +/- ## ======================================= Coverage ? 76.07% ======================================= Files ? 46 Lines ? 3386 Branches ? 460 ======================================= Hits ? 2576 Misses ? 707 Partials ? 103 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 1 month ago

CLI load time: 0:00.30
Pull Request HEAD: 230b92f2b765e93a2fffec2a488ed241e297a20f
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

marctessier commented 1 month ago

I am running some test right now with it :-)

Things that I noticed, the process if multithreaded and will use all available CPUs . Carefull when running on a cluster "head node" ( using up all the shared resources.)

Below I ran in a 40CPU container...

[U20-GPSC/etc/slurm-llnl/slurm]:$ top

top - 11:52:46 up 38 days, 23:52,  0 users,  load average: 37.38, 22.87, 17.36
Tasks:  15 total,   2 running,  13 sleeping,   0 stopped,   0 zombie
%Cpu(s): 97.8 us,  2.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 192032.2 total,  95793.2 free,  28507.4 used,  67731.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 155469.1 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                         
   395 tes001    20   0   10.6g   3.9g 112852 R  3991   2.1 124:45.39 everyvoice

I also ran into an issue when trying to run the alignment on a test dataset that I have.

I received this message below after it ran for about 8 minutes and died. It a pretty big test file that I am using ( ~ 23 minutes of audio / Inuktitut) , I will try the same with something shorter and see if I get the same.

============ Starting job 5002533 on Wed 09 Oct 2024 11:48:37 AM EDT on node ib12be-094.science.gc.ca OS "Ubuntu 20.04.6 LTS"
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/EveryVoice_extract/ever │
│ yvoice/model/aligner/wav2vec2aligner/aligner/cli.py:126 in align_single      │
│                                                                              │
│   123 │   print("performing alignment")                                      │
│   124 │   from .heavy import align_speech_file                               │
│   125 │                                                                      │
│ ❱ 126 │   characters, words, sentences, num_frames = align_speech_file(      │
│   127 │   │   wav, text_hash, model, labels, word_padding, sentence_padding  │
│   128 │   )                                                                  │
│   129 │   print("creating textgrid")                                         │
│                                                                              │
│ /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/EveryVoice_extract/ever │
│ yvoice/model/aligner/wav2vec2aligner/aligner/heavy.py:32 in                  │
│ align_speech_file                                                            │
│                                                                              │
│    29 │   audio, text_hash, model, labels_dictionary, word_padding, sentence │
│    30 ):                                                                     │
│    31 │   emission = get_emission(model, audio.to(DEVICE))                   │
│ ❱  32 │   segments, words, sentences = compute_alignments(                   │
│    33 │   │   text_hash,                                                     │
│    34 │   │   labels_dictionary,                                             │
│    35 │   │   emission,                                                      │
│                                                                              │
│ /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/EveryVoice_extract/ever │
│ yvoice/model/aligner/wav2vec2aligner/aligner/heavy.py:144 in                 │
│ compute_alignments                                                           │
│                                                                              │
│   141 │   │   end = None                                                     │
│   142 │   │   for w_k, w_v in word_hash.items():                             │
│   143 │   │   │   if sentence == re.match(key_pattern, w_k).group(1):        │
│ ❱ 144 │   │   │   │   scores.append(w_v.score)                               │
│   145 │   │   │   │   if start is None:                                      │
│   146 │   │   │   │   │   start = w_v.start                                  │
│   147 │   │   │   │   end = w_v.end                                          │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'dict' object has no attribute 'score'
============ Finished job 5002533 on Wed 09 Oct 2024 11:54:52 AM EDT with rc=1

marctessier commented 1 month ago

Ok, I think I found this minor bug. The command everyvoice segment extract should I think be creating the "OUTDIR" folder if the folder does not exist. In the example below My first try failed with the message " Invalid value for 'OUTDIR': Directory 'OUTPUT' does not exist. " After I created the folder "OUTPUT" it ran with success. ( Very cool! ) Nice work Aidan , I will run more tests using other data sets and combinations and verify more closely the results!

(EveryVoice_extract) [U20-GPSC5]:$ everyvoice segment extract 1.Welcome-16000-16000-mono.TextGrid 1.Welcome-16000-16000-mono.mp3  OUTPUT 
Usage: everyvoice segment extract [OPTIONS] TEXT_GRID_PATH AUDIO_PATH OUTDIR
Try 'everyvoice segment extract -h' for help.
╭─ Error ───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Invalid value for 'OUTDIR': Directory 'OUTPUT' does not exist.                                            │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(EveryVoice_extract) [U20-GPSC5]:$ mkdir OUTPUT
(EveryVoice_extract) [U20-GPSC5]:$ everyvoice segment extract 1.Welcome-16000-16000-mono.TextGrid 1.Welcome-16000-16000-mono.mp3  OUTPUT 
Writing audio to files: 100%|█████████████████████████████████████████████████| 6/6 [00:00<00:00, 974.93it/s]
Success! Your audio is available in /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/Extract_Alignment/Welcome/OUTPUT/wavs and your corresponding metadata file is available in /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/Extract_Alignment/Welcome/OUTPUT/metadata.psv

(EveryVoice_extract) [U20-GPSC5]:$ pwd
/home/tes001/u/TxT2SPEECH/Extract_Alignment/Welcome/OUTPUT
(EveryVoice_extract) [U20-GPSC5]:$ find .
.
./wavs
./wavs/segment0.wav
./wavs/segment4.wav
./wavs/segment2.wav
./wavs/segment5.wav
./wavs/segment3.wav
./wavs/segment1.wav
./metadata.psv
(EveryVoice_extract) [U20-GPSC5]:$ cat metadata.psv 
basename|text
segment0|ᑐᙵᓱᒋᑦ.
segment1|ᑐᙵᓱᑉᐳᖓ.
segment2|ᐃᓄᒃᑎᑑᓲᖑᕕᑦ?
segment3|ᐄ, ᒥᑭᔪᒥᒃ.
segment4|ᕇᑕᐅᔪᖓ. ᑭᓇᐅᕕᑦ?
segment5|ᑕᐃᕕᑎᐅᔪᖓ.

roedoejet commented 1 month ago

I received this message below after it ran for about 8 minutes and died. It a pretty big test file that I am using ( ~ 23 minutes of audio / Inuktitut) , I will try the same with something shorter and see if I get the same.

Hm, yes, we need to spend a bit more time making the alignment more efficient and more robust. I think this is the same as the error @joanise described here: https://github.com/EveryVoiceTTS/EveryVoice/issues/327

Ok, I think I found this minor bug. The command everyvoice segment extract should I think be creating the "OUTDIR" folder if the folder does not exist. In the example below My first try failed with the message " Invalid value for 'OUTDIR': Directory 'OUTPUT' does not exist. " After I created the folder "OUTPUT" it ran with success. ( Very cool! ) Nice work Aidan , I will run more tests using other data sets and combinations and verify more closely the results!

Nice catch - thanks! I fixed this.

marctessier commented 1 month ago

I made sure to remove all numbers from my test. Now I am trying again and removed things like " ? ( ) ! - " and trying again to see if I get the same failure. I am managing to get it to work on small test chunks of that same file. I am trying to hunt down exactly what / where is causing the issue. ( what block / chunk of my test file...) I keep you posted if I can pinpoint and reproduce consistently.

EveryVoiceTTS / EveryVoice