LuteOrg / lute-v3

LUTE = Learning Using Texts: learn languages through reading.
https://luteorg.github.io/lute-manual/
MIT License
485 stars 46 forks source link

Importing Japanese Terms adds zero-width char to some terms #396

Open eujev opened 7 months ago

eujev commented 7 months ago

Description

When importing Terms from a .csv file for Japanese zero-width chars are added to some terms, which results in these words not being parsed or recognized correctly in the text. (Terms were originally from LWT) Maybe related to #371? Or are the terms parsed my Mecab when they are imported? Importing Japanese Terms also seemed to take longer than from other languages.

To Reproduce

Steps to reproduce the behavior, e.g.:

  1. Import terms from a .csv file for Japanese (Example file included) Japanese_example_terms.csv

  2. Create new book (for example with the following example text) Japanese_example_text.txt

  3. Hover over blue words (like 年生 or 日間)

  4. See error ( in the text 年生 and the term in the database with a zero-width char 年​生) it shows the term with the zero-width char and the individual Kanjis.

Screenshots

lute

Extra software info, if not already included in the Description:

jzohrab commented 7 months ago

Hi @eujev , thanks for the issue.

Lute parses each term when imported. Unfortunately for some languages like Japanese, the parsing depends on context. For some terms like the ones you mentioned, this means that the terms will be different when parsed in full texts.

I’m not sure what the best approach is here, open to suggestions. For a data fix, we could do a db data fix. I know that’s not optimal!

eujev commented 7 months ago

Hey @jzohrab, thanks for the quick reply!

I guess the zero-width characters are needed to be able to show the individual parts of a larger Kanji word, or are they also for something else? Also not sure what the best solution could be. With a db data fix you mean removing the zero-width characters or something else?

If it is not advisable to get rid of the zero-width characters, maybe another option is to be able to create an alias for some words. Because only a few of the words are not correctly recognized in the text. (like 青森​県 is recognized as known in the example) So for 年​生 (uft-8: \u5e74\u200b\u751f) create an alias that (\u5e74\u751f) is the same word. Edit: This could also help with words that have multiple spellings in Japanese like 引き受ける - 引受ける - 引きうける- 引受る I guess that would require a db change though. I guess in theory you could also use the parent word for that, but then it would create an extra word and could clutter the popup window. What do you think?

Greetings

jzohrab commented 7 months ago

The zero width spaces are to delimit parsed tokens. I had a wiki page about that. In summary, the ZWS characters vastly simplify pattern matching.

a db fix for these words would be just removing the ZWS chars in your words in the db, or creating separate aliases as you mentioned. There’s already an issue for that, but that doesn’t solve your current problem.

can you live with your data as it is now? Unfortunately I’m travelling and am really strapped for time, but if you really need a fix for some reason let me know, I could do something like write a query for your particular case. Hacky but I don’t have time! Cheers and thanks for the thoughts above.

eujev commented 7 months ago

Okay got it. No worries, I will live with the data as it is for now or check how I can create aliases if I find time. Enjoy your travels and cheers!

andypeeters commented 6 months ago

Hi all.

I have a follow-up question/proposal regarding the zero width spaces.

As I currently understand, every time a Japanese text or term is imported/added, the text is first parsed via mecab. It is mecab that adds those zero width spaces, correct? But as we all know, the parsing of mecab is not perfect. It isn't a big problem but due to the fact that because of the presence of the zero width spaces, words aren't always shown correctly, it can be little bit annoying from time to time.

Therefore I have a small proposal as an extra option. Let's make the assumption that a person has already a kind of "tokenized" Japanese text, meaning all words are already split with real white spaces between (not zero width spaces!). This is sometimes done in language textbooks to learn students the word boundaries, or the user already used another tool to do the same thing.

During the import of such a text, the user can select a check box telling Lute to bypass the mecab parser, tokenizing the text just on the real spaces, potentially replacing them by zero width spaces or something else, and displaying that parsed text to the user for reading and term recognition. The state of the check box could be stored in the database so that when the text is edited, the same steps are used.

Could implementing something like this be feasible in terms of desired feature? I might be wrong but it appeared to me that both LWT and LingQ allowed for fixing a text in this way when the word boundaries were not correctly recognized, at least in the past.

jzohrab commented 5 months ago

Hi @andypeeters - thanks for the note.

As I currently understand, every time a Japanese text or term is imported/added, the text is first parsed via mecab. It is mecab that adds those zero width spaces, correct?

What happens: mecab parses the text into smaller parts of speech tokens. Lute joins these tokens with zero-width spaces for pattern matching. e.g. here's mecab parsing:

私は行っています。
私   名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
行っ  動詞,自立,*,*,五段・ワ行促音便,連用タ接続,行う,オコナッ,オコナッ
て   助詞,接続助詞,*,*,*,*,て,テ,テ
い   動詞,非自立,*,*,一段,連用形,いる,イ,イ
ます  助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。   記号,句点,*,*,*,*,。,。,。

When written onscreen, each parsed token is displayed with the zero-width-space in between, i.e.,私/は/行っ/て/い/ます, and if you highlight these tokens to join them into a word that zws is also stored with the word.

Re your suggestion (thanks for the notes!): it's the right suggestion for a project, but I'm not sure how to do it in Lute without making things more complicated in general. Importing is handled by a single thing. Having Japanese/mecab-specific settings on an import page doesn't make sense for non-jp users (of course), and then it opens up the same questions for other parsers like jieba (mandarin) or mecab-ko (korean). It's feasible to do, b/c this is software and there usually nice ways around everything, of course, but it gets tough, and only impacts a subset of users.

I know it's not ideal, but at the moment, given the backlog of things that really should be done that will benefit everyone, I can't invest time in this particular thing.

Despite all of the excuses above, this is still a good suggestion from you. Lute recently introduced the idea of "parser plug ins", and so more languages might be supported that will have this particular need. If so, then having "parser-specific import handlers" might have more of a payoff, and your idea would have more ROI.