FreeLanguageTools / vocabsieve

Simple sentence mining tool for language learning
GNU General Public License v3.0
386 stars 30 forks source link

Better Chinese support #10

Open Madwonk opened 2 years ago

Madwonk commented 2 years ago

Since Chinese doesn't as easily distinguish words/names with spaces as English, attempting to double click a word simply selects the entire sentence.

The Firefox/Chrome extension Zhongwen and Android App Pleco are examples of software which use various methods of automatically detecting words in the dictionary (CC-CEDICT is bundled in the case of Zhongwen, which comes in at only 3.6 MB zipped).

It would be advantageous to integrate CC-CEDICT as a dictionary option for Chinese, as well as leveraging it to help select words in a Chinese sentence. I'm willing to help contribute some code if necessary to help do this, but I'd like some input from the primary developer before doing so.

1over137 commented 2 years ago

I originally wanted to implement such a feature, but I couldn't quite afford to commit the time and maintenance burden needed to implement them. At one point I even implemented a Japanese parser, but Yomichan does a much better job, and it added too much in the way of dependencies. I didn't know of a dictionary-based way of splitting words before. However, for Chinese this can be pretty useful. If you are willing to contribute code to make this happen, feel free to do so! I would be glad to accept a patch/PR for this. Here are a few points to keep in mind:

If you have any questions about the architecture of the program, feel free to ask!

1over137 commented 2 years ago

Also, I was actually considering parsing the sentences with something like jieba. That uses a more sophisticated algorithm to split the words and may work for words not covered (proper names). In addition, I can also implement support for cedict as a format

Ceynou commented 2 years ago

I barely know a thing about programming, let alone coding, but you're talking about using spaCy right, it supports 64 languages so I guess that would work for all the other language vocabesieve supports

1over137 commented 2 years ago

I barely know a thing about programming, let alone coding, but couldn't you use spaCy instead of jieba, it supports 64 languages

I am using simplemma for lemmatization, which is simply a big text lookup program. There is no current need for spacy for this project. It's a fairly big and complicated framework for many things (including natural language understanding, tagging, classification, etc) and requires dynamically downloading resources.

GrimPixel commented 2 years ago

In fact, not only Chinese, but also Japanese, Korean has this problem. In Vietnamese, space is used to separate syllables; in Thai and Lao, space is used to separate sentences. There is a list of such tools: https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Text-Processing-Tools#Word_Segmentation

1over137 commented 2 years ago

@GrimPixel @BenMueller So, anyone of you going to actually implement this?

GrimPixel commented 2 years ago

I just knew about tools for word segmentation and saw you needed them. I have no experience with them.

parthshahp commented 5 months ago

Is this something that still needs work? Has there been any progress in the last few years? I'm happy to take a look at it if it's needed.