I was testing MFA and CTC-segmentation based forced alignment techniques for English. Most open source CTC English models are trained on characters and thus by using CTC-segmentation for forced alignment we get character and word boundaries. In some cases I found these word boundaries to be more accurate than MFA word boundaries. But since I need phoneme duration for a downstream TTS task, I couldn't do much about it.
Is there a way to get more accurate phoneme boundaries if we know accurate word boundaries from another method using the current toolkit? What can be changed in the current toolkit to achieve this if it is already not possible.
I was testing MFA and CTC-segmentation based forced alignment techniques for English. Most open source CTC English models are trained on characters and thus by using CTC-segmentation for forced alignment we get character and word boundaries. In some cases I found these word boundaries to be more accurate than MFA word boundaries. But since I need phoneme duration for a downstream TTS task, I couldn't do much about it. Is there a way to get more accurate phoneme boundaries if we know accurate word boundaries from another method using the current toolkit? What can be changed in the current toolkit to achieve this if it is already not possible.