Support for Korean parsing

LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading.

https://luteorg.github.io/lute-manual/

MIT License

489 stars 46 forks source link

Support for Korean parsing #11

Open rocky1638 opened 1 year ago

rocky1638 commented 1 year ago

Is your feature request related to a problem? Please describe.

I'd like for there to be a way to parse Korean texts as I'm learning Korean.

Describe the solution you'd like

Implement a Korean parser based on MeCab-Ko.

Describe alternatives you've considered

I tried to use MeCab to parse a Korean text, but it didn't work, even though MeCab and MeCab-Ko seem to have similarities based on my online research.

(I was using \p{Hangul} as the Regex for character matching, but I'm not sure if that's correct either so that could have been the issue.)

jzohrab commented 1 year ago

Hi there, thanks for the message. Unfortunately I have no idea how to implement this effectively. Korean is known to be tough to parse.

Japanese has MeCab, which can be installed on the command line for all OS's that I can see. ref https://github.com/jzohrab/lute/wiki/Installing-dependencies#mecab for installation notes ... Is there anything similar for Korean? I found https://github.com/shirakaba/mecab-ko but that appears to be iOS (mac) only. If something like the Lute install instructions were available for MeCab-Ko, that should be feasible.

There are python libraries that might work, but I can't see how to make them work for Lute (in its current state) easily.

jzohrab commented 12 months ago

@rocky1638 made changes in Lute v2 for Korean, would be nice to add them to v3. Message from him/her in Discord:

... finally ended up pushing the changes for korean up to my fork here: https://github.com/rocky1638/lute-ko ... hand the baton off to you! Wrote some context at the top of the README for known bugs and what I added.

Not sure how tough it will be to port this over -- architecturally, most things are still the same, but the Japanese parser uses natto-py and has a MECAB_PATH user setting. Maybe the ko mecab has something similar, not sure.

emanuelps2708 commented 11 months ago

I found a Korean parser that might work: https://github.com/konlpy/konlpy https://konlpy.org/en/latest/

emanuelps2708 commented 11 months ago

This one it says that's cross platform (linux, windows and mac), and it's written in python which might help to implement it to the main lute

jzohrab commented 11 months ago

Thanks @emanuelps2708 for the note. The code appears to also use Java, which is interesting. That's not a dealbreaker, but it might be slow (?), and it might make installs complicated for some people. For Docker, it would likely be fine, just means more stuff in the image. I'm also not sure how that project loads dictionaries, more testing needed!

emanuelps2708 commented 11 months ago

Thanks for replying and for your patience, sorry for requesting too many things. About the dictionaries it has different options Hannanum, Kka, and Mecab-ko. It appears that Mecab is the best option related to speed followed by Hannanum https://konlpy.org/en/latest/morph/ https://konlpy.org/en/latest/data/

Finally for the testing I don't know how to do it but if you need someone I'll be more than happy to help as much as I can https://konlpy.org/en/latest/test/ (maybe this can help) Thanks again @jzohrab ;)