epistularum / hunspell-ja-deinflection

Hunspell dictionary to deinflect all Japanese conjugated verbs to the dictionary form and suggest correct spelling.
9 stars 2 forks source link

Suggestions about improve forecast accuracy #1

Closed NoHeartPen closed 1 year ago

NoHeartPen commented 2 years ago

Sorry, I am a Chinese student, studying Japanese Language and Literature, so my English is not very good, I hope you can forgive the grammatical errors in this sentence.

It looks like you left out some details:

for example: 形容詞: 高そうだ According to your point of view, we can find 高い through 高そ, but judging from the .aff file, you seem to have missed this point (sorry, Iread the hunspell manual, but I still don't understand what does the 0 do, so I think you're missing this transformation)

|500

|500

Then it's about 一段動詞, they all end in る, so a verb like 晩ご飯を食べた, we should look up 食べた (if we use 食べ, we may find 食ぶ), but it looks like you are ignoring this too:

|500

Finally, about 口語: すべてのおやつは食べちゃった, maybe we should use 食べちゃ instead of 食べち, because our mouse may not be so accurate.

|500

And There are many other special expressions in Japanese, such as katakana:

|500

You can refer to this Chinese college student's hunspell file in Github :

https://github.com/MrCorn0-0/hunspell_ja_JP/

Although his approach to verbs is not the same as yours, it may help you.

The following are the common word endings in modern Japanese that I have collected: ( is a 五段動詞, and the is a 上一段動詞, because I am not familiar with 古文, so it does not include the ancient language) |500

The above data comes from the "NoJishoKei" project. You can find it in FreeMdict's 《真·哪里不会点哪里_日本語非辞書形辞典_v2 - 日语 - FreeMdict Forum》 (you just need to put it in the content folder like other mdx, when your clipboard is 食べた, it will give a hyperlink that can jump to 食べる ).

You can also learn more on GitHub:

https://github.com/NoHeartPen/NonJishoKei

It is a development project dedicated to providing the best clipboard word search experience, and currently, there is a 日本語非辞書形辞典.mdx and 2 scripts use Python and JavaScript languages (sorry, The software used in the two scripts are all developed by Chinese, and it may be inconvenient for foreigners to use. I will submit a PR on Saladict in the future. Saladict is an online dictionary add-ons like youmichan, but you can use Weblio and Weblio 英和和英辞典)

Finally, it's really nice to meet someone who has the same idea as me, I hope you can achieve the best clipboard word lookup experience on Goldendict with Hunspell :)

中文原文

提高预测的准确性

不好意思,我是一名日语专业的中国大学生,我的英语不是很好,希望您能原谅我的语法错误。

看起来您忽略了很多细节,比如形容詞高そうだ 按照您的观点,我们可以通过高そ查到高い,但是从.aff文件来看,您似乎遗漏了这一点(不好意思,我阅读了 hunspell 的手册,还是没有搞懂这里的0有什么作用,所以认为您是遗漏了这个变形)

|500

|500

然后是关于一段動詞,它们的词尾假名都是る,所以像晩ご飯を食べた这样的动词,我们应该划食べた(如果用食べ,我们可能查到的是食ぶ),但看起来您也忽略了这一点:

|500

最后是关于口语的表达:すべてのおやつは食べちゃった,也许我们应该用食べちゃ,而不是食べち,因为我们的鼠标可能没有那么准确。

|500

日语还有很多其他特殊的表达,比如片假名:

|500

您可以参考这位中国大学生的 hunspell 文件:

https://github.com/MrCorn0-0/hunspell_ja_JP/

虽然他对于动词的处理原则和您不太一样,但应该对您有所帮助。

以下是我搜集的在现代日语中常见的的词尾变换:(五段動詞上一段動詞,由于我不太熟悉古文,所以并不包含古语) |500

以上的数据来自《日本語非辞書形辞典》项目,您可以在 FreeMdict《真·哪里不会点哪里_日本語非辞書形辞典_v2 - 日语 - FreeMdict Forum》(您只需要像使用其他 mdx 放到 content 文件夹即可,它会在您的剪贴板是食べた的时候给出一个可以跳转到食べる的超链接)。

您也可以在 GitHub 上了解更多信息:

它是一个致力于提供最好的剪贴板查词体验的开发计划,目前已经有一个日本語非辞書形辞典.mdx和 2 个分别使用了 Python 和 JavaScript 语言编写的脚本(很抱歉,这 2 个脚本所用的软件都是中国人开发,外国人使用起来会不太方便,后续我会在以沙拉查词上提交 PR,它是一个像 youmichan 的在线词典,但您可以使用 Weblio、Weblio 英和和英)

最后,真的很高兴遇到了和我有同样想法的人,希望您能用构词法在 Goldendict 上实现最棒的剪贴板查词体验:)

epistularum commented 1 year ago

@NoHeartPen

Please excuse for the very late reply, I somehow must have missed the GitHub notification? It seems like all the questions are related to a misunderstanding of Japanese conjugation/inflection. (My native language is also not English so bear with me)

1. 高そうだ

The construction of this example is the root (語幹) of the adjective combined with そうだ. Here is the relevant dictionary entry for this grammatical point: image

2. 一段動詞

All forms for this verb class are the same, only the る part is removed. Please refer to the table below. For a more concrete example: 書いた: 書い(連体形 of 書く) + た(inflectional suffix) 食べた: 食べ(連体形 of 食べる) + た(inflectional suffix) The inflectional suffix (助動詞) is not part of the verb base, it's just a suffix to the verb base. image image

3. 食べちゃ

The construction here is once again just the verb base 食べ + the inflectional suffix ちゃう. Here is the relevant dictionary entry: image

4. Alternate spelling

Alternate spelling of words such as チョロチョロ instead of ちょろちょろ is something that I have though about is outside the scope of this project since it is not related to conjugation. Such a project would require a lot of dedication and time because it would need to cover all these issues:

As you you must know, there is huge amount of inflectional suffixes which can be specific to a dialect or a time period. Cataloging all suffixes is an impossible task as you would not only have to catalog all possible past and historical declinations but also all regional dialects. To go around this issue, this hunspell dictionary is based on the verb stem which is the common denominator across the Japanese language.

The drawbacks for my method are two fold:

  1. Less user friendly. A better understanding of japanese conjugation and verb forms is needed.
  2. Might result to a less precise dictionary lookup. For example, if you are faced with 高そう you will need to look up for 高 but this will result in a lot of dictionary entries unrelated to the adjective 高い: image

Advantages:

  1. Is able to de-inflect any conjugation in an agnostic way without the need to have each inflectional suffix hardcoded in a de-inflection table.
  2. Less bloated

I am aware of your project and I think it's very cool as well. Both methods have merits and demerits, it's up to user preference. I wish you the best of luck in your project!