CanCLID / to-jyutping

粵語拼音自動標註工具 Cantonese Pronunciation Automatic Labeling Tool
BSD 2-Clause "Simplified" License
12 stars 4 forks source link

Jyutping Improvement #4

Closed hockyy closed 1 month ago

hockyy commented 4 months ago

I don't know how you farm those jyutping,

https://words.hk/faiman/analysis/wordslist.json https://words.hk/faiman/analysis/charlist.json

but anyway, if you haven't included this method, I think you can try. I'm too lazy to code a new library so I will use your to-jyutping.

Just so if you wanna update the dictionary, you can parse all the words from there, for the tokenizer, we can use jieba

https://github.com/hockyy/jieba-cantonese

I've made a script to auto generate jieba user dict to tokenize, so querying jyutping per token can be better, if the result don't exist, fall back to per character jyutping

hockyy commented 4 months ago

let me know if you need any help.

I'm currently developing this project https://github.com/hockyy/miteiru

laubonghaudoi commented 4 months ago

@graphemecluster 據我所知粵典數據係一早就已經用咗嘅?而家嘅更新主要係用咗 Jon 嘅字型數據?

graphemecluster commented 4 months ago

而家淨係用 Jon 嘅數據,但都肯定準過結巴分詞 @chaaklau 你覺得你粵典個 word list 標粵拼有冇用?

graphemecluster commented 4 months ago

@hockyy The accuracy should reach more than 99% since our latest updates (JS/TS version 2.0.0 / Python version 0.3.0) a few days ago.

hockyy commented 4 months ago

ack ack okk thank you info

hockyy commented 4 months ago

image

btw 呢個import唔到

hockyy commented 4 months ago

我聽日debug啊好眼瞓😪