Closed hockyy closed 1 month ago
let me know if you need any help.
I'm currently developing this project https://github.com/hockyy/miteiru
@graphemecluster 據我所知粵典數據係一早就已經用咗嘅?而家嘅更新主要係用咗 Jon 嘅字型數據?
而家淨係用 Jon 嘅數據,但都肯定準過結巴分詞 @chaaklau 你覺得你粵典個 word list 標粵拼有冇用?
@hockyy The accuracy should reach more than 99% since our latest updates (JS/TS version 2.0.0 / Python version 0.3.0) a few days ago.
ack ack okk thank you info
btw 呢個import唔到
我聽日debug啊好眼瞓😪
I don't know how you farm those jyutping,
https://words.hk/faiman/analysis/wordslist.json https://words.hk/faiman/analysis/charlist.json
but anyway, if you haven't included this method, I think you can try. I'm too lazy to code a new library so I will use your
to-jyutping
.Just so if you wanna update the dictionary, you can parse all the words from there, for the tokenizer, we can use jieba
https://github.com/hockyy/jieba-cantonese
I've made a script to auto generate jieba user dict to tokenize, so querying jyutping per token can be better, if the result don't exist, fall back to per character jyutping