MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.3k stars 243 forks source link

Arabic G2P model #148

Open AhmedElsagher opened 4 years ago

AhmedElsagher commented 4 years ago

Hi, I was interested in trying the Pre-trained G2P Arabic model but it only accepted the encoded roman characters for Arabic and the encoding is done by GlobalPhone as you mentioned in the documentation i tried to search for GlobalPhone documentation for Arabic several hours but i found i need to buy to the whole dataset so is there a link for the encoding/romanization or something like that. I only asked because there is several romaniazation standards for Arabic characters so which one i should use? can anyone help with that?

macriluke commented 4 years ago

The closest I've come to learning this so far is this paper

The transcripts are available in the original orthographic script, but were additionally mapped into a romanized form. For Arabic only the romanized version is available; Tamil is not processed yet. For romanization, several tools were developed which vary from simple context-free mapping tools to more elaborated algorithms, like for the segmentation and pinyinzation of Chinese Hanzi. The romanized version of all transcripts is coded in ASCII-7.

Dallak commented 3 years ago

Is there any updates regarding the romanized system used in MFA? Nothing is available online, any advice would be greatly helpful.