Closed rehandaphedar closed 4 weeks ago
Hi, you can modify the preprocessing function to skip splitting, and presplit the input, or you can combine split words later if you know how
That worked wonderfully, even on a 128 word long string with occasional Tajwīd/Waqf characters. Thanks a lot!
Hi, I wonder if there is a way to restrict the possible words that can be output and/or pass in a list of words as input instead of a string, so that the exact word splitting is preserved.
Let us say I have an audio file, along with text already transcribed:
And I already have that text broken into words in a specific way, preserving which is important:
With the current model/config,
Mahmoud
,Ashraf
,owns
,a
,pizza
,shop
would be separate words. It doesn't seem possible to haveMahmoud Ashraf
, andpizza shop
as single words.I understand this maybe out of scope for this repo, so no problem at all if this is not implemented.
For a better understanding of why I'm trying to get this feature, if you visit https://api.quran.com/api/v4/verses/by_key/1:1?words=true&audio=2&word_fields=text_uthmani, you can see there are words (Focus on
text_uthmani
). At the end, there are timings (the format is a bit weird though) which expect the exact word splitting used above. I'm trying to build a wrapper program that automatically generates timings (which do not have to extremely accurate).ctc-forced-aligner
works mostly well, however, it sometimes splits words differently.