getreuer / qmk-keymap

My keymap & reusable QMK gems
Apache License 2.0
301 stars 45 forks source link

Larger data set #1

Closed drashna closed 2 years ago

drashna commented 2 years ago

By chance, do you have a larger data set for the auto-correction functionality?

filterpaper commented 2 years ago

@drashna There's a larger list here that I'm working on integrating. The code has to be updated for apostrophe input: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

drashna commented 2 years ago

Too many entries in that to really apply. eg, https://gist.github.com/drashna/46c64ea29a382f754b2cf957ebd8e924

I'm running the autocorrect code on a blackpill based controller, so that has a lot more system memory and program memory than the atmega32u4, and amuch higher clockspeed.

would it be possible to expand the number of possibly entries.

getreuer commented 2 years ago

@drashna thanks for the awesome idea! A larger dictionary would be useful. And thank you @filterpaper for that helpful Wiki link, plus your other help already on email for improving the autocorrection feature.

I combed over the list and some other sources to make a larger 400-entry dictionary: https://github.com/getreuer/qmk-keymap/blob/main/features/autocorrection_dict_extra.txt It builds to a table of about 6000 bytes, so it's big but plausible depending on the board. I tested out that it works successfully on Moonlander.

Going beyond that, I tried building your ~3000 entry dict and see that it fails with AssertionError on assert 0 <= byte_offset <= 0x7fff. This assertion comes up because the implementation uses 15-bit offsets to link between trie nodes (using a uint16 where the high bit has a separate purpose). There is a description here if anyone is interested in the details. So the implementation currently supports a table up to 32 KB, or on the order of 2000 entries, which is not large enough for your dict.

There are many links in the trie, so I'd rather not simply increase the offsets' bit width if it can be avoided. I have a couple ideas, I'll think this over and report back when I have something.

drashna commented 2 years ago

Yeah, @filterpaper and I have both been messing with the code.

And yeah, ran into the assertion error, and figured that was the case. But I'm not too sure what all is going on in this code, so fixing/improving it is over my head.

getreuer commented 2 years ago

@drashna in commit 99a2b160963129f9efb6062ee6ab41b1115308e7, I worked out a revision to increase the max supported table size from 32 KB to 64 KB. I successfully built your ~3000 entry dict:

Processed 2982 autocorrection entries to table with 48293 bytes. 

It flashed and worked successfully on my Moonlander. It's cool to be running autocorrect with that many entries.

Side note: There are "false trigger" warnings on that dict. Install the "english_words" Python package to enabled this checking (pip install english_words). Here is a sampling:

Warning:1483: Typo "hten" would falsely trigger on correctly spelled word "heighten".
Warning:1487: Typo "hting" would falsely trigger on correctly spelled word "nightingale".
Warning:1666: Typo "interm" would falsely trigger on correctly spelled word "intermediary".
Warning:1786: Typo "lsat" would falsely trigger on correctly spelled word "pulsate".
...

For practical use, I'd add word breaks (e.g. :hten:) before and/or after those entries to avoid these false triggers.

drashna commented 2 years ago

Yup, I saw. I was looking at the issue and noticed it was closed.

And thank you! And yeah, it's nice having the option for a much larger data set on boards that can support it!

Also, filterpaper got it working with progmem. That doesn't matter on boards like the moonlander, but for AVR boards... it can be important.

drashna commented 2 years ago

Tested the changes, and it doesn't work for me. I suspect that there are some changes to the autocorrect.c file that didn't get pushed?

getreuer commented 2 years ago

D'oh, you are exactly right. I just added the forgotten autocorrect.c update in commit da5aacdb05b35c77656062ca8f2c2add8f96047b.

drashna commented 2 years ago

Well, I can't say that I haven't done the same. 😆 And yeah, with those changes, everything is working again!

filterpaper commented 2 years ago

@getreuer Thank you for the updates!