Support more word segmentation tools

LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading. Python/Flask.

MIT License

349 stars 39 forks source link

Support more word segmentation tools #218

Open GrimPixel opened 5 months ago

GrimPixel commented 5 months ago

Is your feature request related to a problem? Please describe.

There is only a MeCab.

Describe the solution you'd like

Add support for those mentioned at https://polyglotclub.com/wiki/Language/Multiple-languages/Culture/Text-Processing-Tools#Word_Segmentation.

jzohrab commented 5 months ago

Thanks very much, this is how things really should be done. Lute should just be for reading, the segmentation/tokenization should be handled outside of it. The link you gave is very useful, appreciated.

The problem I run into is how to do a "plugin architecture" (#116 ), as different users/languages will have different requirements. That may be very easy to do, but it may also be brutal! I don't have a good handle on it yet.

jzohrab commented 2 weeks ago

The plugin architecture issue of #116 is now done. So if anyone wants to hack on adding new parsers/segmenters, there are notes in the wiki. :-)