Specify word splitting?

rehandaphedar commented 4 weeks ago

Hi, I wonder if there is a way to restrict the possible words that can be output and/or pass in a list of words as input instead of a string, so that the exact word splitting is preserved.

Let us say I have an audio file, along with text already transcribed:

Mahmoud Ashraf owns a pizza shop.

And I already have that text broken into words in a specific way, preserving which is important:

Mahmoud Ashraf
owns
a
pizza shop

With the current model/config, Mahmoud, Ashraf, owns, a, pizza, shop would be separate words. It doesn't seem possible to have Mahmoud Ashraf, and pizza shop as single words.

I understand this maybe out of scope for this repo, so no problem at all if this is not implemented.

For a better understanding of why I'm trying to get this feature, if you visit https://api.quran.com/api/v4/verses/by_key/1:1?words=true&audio=2&word_fields=text_uthmani, you can see there are words (Focus on text_uthmani). At the end, there are timings (the format is a bit weird though) which expect the exact word splitting used above. I'm trying to build a wrapper program that automatically generates timings (which do not have to extremely accurate). ctc-forced-aligner works mostly well, however, it sometimes splits words differently.

MahmoudAshraf97 commented 4 weeks ago

Hi, you can modify the preprocessing function to skip splitting, and presplit the input, or you can combine split words later if you know how

rehandaphedar commented 4 weeks ago

That worked wonderfully, even on a 128 word long string with occasional Tajwīd/Waqf characters. Thanks a lot!

MahmoudAshraf97 / ctc-forced-aligner

Specify word splitting? #29