binarybottle / engram

Arno's Engram v2.0 ("Engram") layout is an optimized key layout for touch typing in English based on ergonomic considerations, with a protocol and software for creating new, optimized key layouts in other languages.
MIT License
264 stars 24 forks source link

How to create Engram for Russian? #49

Open binarybottle opened 1 year ago

binarybottle commented 1 year ago

I received an email last week:

[Russian has] 33 letters. So I think the only option is to use Home and Page Up for ъ and ь (these keys used for typing in Japanese layout for Kinesis keyboard). I made preliminary version of the layout (for ergonomic orthonormal keyboard like new Kinesis Advantage360): [ № = − + / ’ ° § % * ]
Ф 1 2 3 4 5 6 7 8 9 0 Э

    БУКЛ  –(  —) ДГЗХЦ
    ЫВАЕ  ,;    .:   НОСТЁ
    ЙЧИЯ  -_  ?!   РМПЖ
    Ш                         ЩЮ
              Ъ@  Ь#
               «„     »“

I made some changes in the typographical symbols scheme according to Russian typographic tradition. Letter arranged by frequency, including bigrams and trigrams. But I am still not sure if it is good or not.

binarybottle commented 1 year ago

@iandoug -- The person who wrote to me is pointing me to the Leipzig corpus for Russian (as mentioned here: http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/russian-letter-frequencies/). Do you know if it's good enough for relying on to optimize a Russian key layout?

iandoug commented 1 year ago

Hi Arno

I usually differentiate between lower and upper (because it affects Shift key usage). The crypto sites don't.

On the other hand, you also work unicase, so if they've run the numbers then it should be okay. :-)

For me to do it would require trying to first clean up all the non-Russian and other crud in their files. I used to try and do it the hard way by fixing each errant character but that is tedious in the extreme. So lately I adopted a more brute force method... have a list of valid characters, any lines containing a character NOT in that set is simply dropped ... the whole line. Which may slightly skew the distribution of the valid characters but is vastly more practical.

The layout above seems to be missing some chars from US ANSI that are needed in programming. Also on the WP image.

Wikipedia shows big-ass enter variant, I would have thought they would use ISO at least for an extra key. https://en.wikipedia.org/wiki/Keyboard_layout#/media/File:KB_Russian.svg

WP says "Keyboards in Russia always have Cyrillic letters on the keytops as well as Latin letters. Usually Cyrillic and Latin letters are labeled with different colors. "

Wonder if you can access the English chars while using the Russian layout ?....

Let me take a look at how big the Russian corpus is ...

Cheers, Ian

iandoug commented 1 year ago

I see WP also has a letter frequency chart (towards bottom)

https://en.wikipedia.org/wiki/Russian_alphabet

Can we get a canonical list of which chars must be on keyboard?

I would probably want to put the brackets, #, | etc back :-)

I've grabbed the Russian files from Leipzig ... they have other-Soviet-non-Russian sources which I am ignoring for now.

Cheers, Ian

iandoug commented 1 year ago

Leipzig has some stats. Can't see letter freq

https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013

but here is punctuation: https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013/2.1.4_Special%20Characters.html

Contains characters that I would typically ignore.

binarybottle commented 1 year ago

Thank you for taking a look, @iandoug!

I have no idea what the complete list of characters should contain...

iandoug commented 1 year ago

Russian alphabet (WP order) АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя

Other characters on current Russian (Windows) keyboard: 1234567890!"№;%:?*₽()-_=+\/.,

Others on US ANSI: @#$^&[]{}|`~'<>

iandoug commented 1 year ago

Chinese Apple style.

https://ae01.alicdn.com/kf/Se61863d874b54fe2b7b62167caa158140/Keychron-Russian-ANSI-Layout-OEM-Dye-Sub-PBT-Keycap-Set-for-Q1-Q2-K2-Mechanical-Keyboard.png

https://github.com/DandelionSprout/Russian-Extended

iandoug commented 1 year ago

russian-dirty.txt : 1,697,253,979 bytes, 963,432,405 characters according to wc russian-clean1.txt: 1,219,345,880 bytes, 670,203,855 characters according to wc

I used following chars: АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя1234567890!№;%:?*₽()-_=+/.,«»" plus space, tab and enter Included the «» because quite common, and in layout above. Fixed assorted dashes and ellipsis.

Char freq and bigram counts attached. russianfreq1.txt rawfollow-russian1.csv

Busy generating 1MB chained bigrams but its struggling a bit so don't know how good the final result will be.

Sample of ignored text attached. ignored

iandoug commented 1 year ago

generating

iandoug commented 1 year ago

Chained bigrams. russianmonkeytest-1MB.zip

binarybottle commented 1 year ago

Amazing -- Thank you for cleaning up the corpus! You know, I would write to those who maintain the original corpus so that they host your cleaned version and credit you accordingly.

7orlum commented 1 year ago

It would be better to generate Щ after double click Ш and Ъ after double click Ь. These two letters are used really rare unlike Home and Page Up keys.

asim215 commented 1 year ago

I think all punctuation of engram in the middle must be preserved. You can also combine on one button if needed: (е, ё), (и, й), (ь, ъ), (ш, щ), ... . I would prefer in syntax symbols engram layout and on free numbers querty. (engram over querty) It will be some kind of hybrid. Also Home and Page Up / Down must be on separate keys from characters/numbers layout. Right now I use EgroDone with engram and russian (querty). So russian layout repeat qwerty on this keyboard.

binarybottle commented 1 year ago

@iandoug -- Any thoughts on the above comments, or do you think your corpus and tables are ready?

What do the two columns of numbers represent in russianfreq1.txt?

I am having difficulty parsing the rawfollow-russian1.csv file. To create an engram-russian layout, I just need bigrams and bigram frequencies. Would you be able to create a table with these?

No rush on responding to any of the above, as I have plenty to do, and I think https://github.com/binarybottle/engram/issues/58 is a more pressing challenge.

iandoug commented 1 year ago

Those are probably wrong, will send revised versions in the week.

binarybottle commented 1 year ago

Thank you, but again no rush -- I just want to make sure I have the right data to work with when I get to this in the future.

SaphireLattice commented 7 months ago

A particularly annoying problem is the Ё. This letter has been basically an afterthought for quite a while, and I wonder if the corpus might be contaminated by people substituting it with Е for basically a century at this point.

I'm also quite curious how to fit the "extra" letters that the Russian alphabet requires on, say, ortholinear split keyboard. Even the base English version requires some care.

Probably need another issue opened for this, but it's the problem that made me look at this page in first place. I've been trying to consider how to even try to fit Engram on my Sofle (mirrored split, each side being 4x6 main key area and 5 below that of which 2 are thumb) and I've realized that it would require shuffling quite a bit around. Which would have been fine if I didn't also need to have ЙЦУКЕН (JCUKEN) around, which is mapped with QWERTY keyboard in mind, and so. I suppose on home desktop I can do whatever to make things work okay, but it still makes for an awkward setup.