g0v / ASR-no-Pangcah

Pangcah ê 語音辨識
MIT License
0 stars 1 forks source link

產生data/train #1

Closed sih4sing5hong5 closed 4 years ago

sih4sing5hong5 commented 4 years ago
sih4sing5hong5 commented 4 years ago

@miaoski ami.json 前5筆是sannh?我生例出來

miaoski commented 4 years ago
{"examples": [{"description": "解釋1:太短的", "sentence": "O tadakamoko'ay kira keliw no miso.", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/tadakamoko'ay_{1}_@_1.1.mp3", "zh_Hant": "你的線太短了。"}], "name": "tadakamoko'ay", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/tadakamoko'ay_{1}.mp3", "frequency": "詞頻:★(1)", "source": "kamoko'"}
{"examples": [{"description": "解釋1:愚蠢", "sentence": "Pamolesa'en no mita saan ko maapaay to tamdaw to tamina i riyar.", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/maapaay_{1}_@_1.1.mp3", "zh_Hant": "在海中那個愚蠢的人說要讓船龜裂漏水。"}], "name": "maapaay", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/maapaay_{1}.mp3", "frequency": "詞頻:★(2)", "source": "apa"}
{"examples": [{"description": "解釋1:當成冰箱", "sentence": "kalapingsiyang han no Aysokimo a tamdaw ko soreda.", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/kalapingsiyang_{1}_@_1.1.mp3", "zh_Hant": "愛斯基摩人把冰塊當成冰箱。"}], "name": "kalapingsiyang", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/kalapingsiyang_{1}.mp3", "frequency": "詞頻:★(1)", "source": "pinsiyang"}
{"examples": [{"description": "解釋1:呼叫", "sentence": "Iyoyen ho ciira!", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/iyoyen_{1}_@_1.1.mp3", "zh_Hant": "呼叫他一下!"}], "name": "iyoyen", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/iyoyen_{1}.mp3", "frequency": "詞頻:★(1)", "source": "iyoy"}
{"examples": [{"description": "解釋1:驚嚇;水土不服(指嬰孩)", "sentence": null, "pronounce": null, "zh_Hant": null}, {"description": "備註", "sentence": null, "pronounce": null, "zh_Hant": null}], "name": "cahekit", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/cahekit_{1}.mp3", "frequency": "詞頻:★(0)", "source": null}
sih4sing5hong5 commented 4 years ago

用頭一筆

{"examples": [{"description": "解釋1:太短的", "sentence": "O tadakamoko'ay kira keliw no miso.", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/tadakamoko'ay_{1}_@_1.1.mp3", "zh_Hant": "你的線太短了。"}], "name": "tadakamoko'ay", "pronounce": "https://e-dictionary.apc.gov.tw/MultiMedia/Audio/ami/tadakamoko'ay_{1}.mp3", "frequency": "詞頻:★(1)", "source": "kamoko'"}

data/train/text

標點符號mài ài,詞、句lóng愛khǹg,大小寫看你欲分--無lóng好,

Pangcah01 tadakamoko'ay
Pangcah02 O tadakamoko'ay kira keliw no miso

data/train/wav.scp

參考 https://github.com/i3thuan5/tai5-uan5_gian5-gi2_hok8-bu7/blob/master/臺灣言語服務/Kaldi語料匯出.py#L220

Pangcah01 ffmpeg -i "tadakamoko'ay_{1}.mp3" -f wav -ac 1 -ar 16000 pipe:1 | 
Pangcah02 ffmpeg -i "tadakamoko'ay_{1}_@_1.1.mp3" -f wav -ac 1 -ar 16000 pipe:1 | 

data/train/segments

開始秒數、結束秒數

Pangcah01 0 xx
Pangcah02 0 yy

data/train/utt2spk

若是kāng人,就kāng-khuán。

Pangcah01 Pangcah
Pangcah02 Pangcah

當做ta̍k句lóng無kāng人

Pangcah01 Pangcah01
Pangcah02 Pangcah02

data/local/dict/lexicon.txt

text出現ê詞攏ài出現,對應ê phoneme,ng是一ê phoneme,需要寫做伙(看上尾例)。大小寫ê phoneme 應該ài kāng-khuán(第2例)

tadakamoko'ay t a d a k a m o k o ' a y
O o
kira k i r a
keliw k e l i w
no n o
miso m i s o
nga'ay ng a ' a y

data/local/dict/nonsilence_phones.txt

lexicon 全部有出現ê phoneme

'
a
d
e
i
k
l
m
n
ng
o
r
s
t
w
y
sih4sing5hong5 commented 4 years ago

詳細定義tī https://kaldi-asr.org/doc/data_prep.html

miaoski commented 4 years ago

MP3 kám ē-tàng khǹg-tī GitHub? Pán-khoân ū ûi-hoán bô?

miaoski commented 4 years ago

https://github.com/g0v/Pangcah-ASR/commit/cbdeef32c3b943dda6cc34500e8098dbcb7a1891

sih4sing5hong5 commented 4 years ago

MP3 kám ē-tàng khǹg-tī GitHub? Pán-khoân ū ûi-hoán bô?

政府ê資料無問題

https://law.moj.gov.tw/LawClass/LawAll.aspx?PCode=J0070017 著作權法 第 50 條 以中央或地方機關或公法人之名義公開發表之著作,在合理範圍內,得重 製、公開播送或公開傳輸。