danielt998 / HanziToAnki

This is a program that takes a Chinese text as input and converts it to an Anki Deck
MIT License
22 stars 0 forks source link

Do better segmentation #217

Open danielt998 opened 2 weeks ago

danielt998 commented 2 weeks ago

We should take inspiration from https://github.com/fxsjy/jieba either finding a Java a library to do so or producing our own implementation. My understanding is that it works by producing a DAG and looking at all the possible ways of segmenting a sentence/clause and using word frequency to calculate a probability.

danielt998 commented 2 weeks ago

The Java equivalent does exist in Maven: https://mvnrepository.com/artifact/com.huaban/jieba-analysis https://github.com/huaban/jieba-analysis

It does look unmaintained though - I don't know if it'll need upgrading for use with newer Java versions etc