Open binarybottle opened 1 year ago
@iandoug -- The person who wrote to me is pointing me to the Leipzig corpus for Russian (as mentioned here: http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/russian-letter-frequencies/). Do you know if it's good enough for relying on to optimize a Russian key layout?
Hi Arno
I usually differentiate between lower and upper (because it affects Shift key usage). The crypto sites don't.
On the other hand, you also work unicase, so if they've run the numbers then it should be okay. :-)
For me to do it would require trying to first clean up all the non-Russian and other crud in their files. I used to try and do it the hard way by fixing each errant character but that is tedious in the extreme. So lately I adopted a more brute force method... have a list of valid characters, any lines containing a character NOT in that set is simply dropped ... the whole line. Which may slightly skew the distribution of the valid characters but is vastly more practical.
The layout above seems to be missing some chars from US ANSI that are needed in programming. Also on the WP image.
Wikipedia shows big-ass enter variant, I would have thought they would use ISO at least for an extra key. https://en.wikipedia.org/wiki/Keyboard_layout#/media/File:KB_Russian.svg
WP says "Keyboards in Russia always have Cyrillic letters on the keytops as well as Latin letters. Usually Cyrillic and Latin letters are labeled with different colors. "
Wonder if you can access the English chars while using the Russian layout ?....
Let me take a look at how big the Russian corpus is ...
Cheers, Ian
I see WP also has a letter frequency chart (towards bottom)
https://en.wikipedia.org/wiki/Russian_alphabet
Can we get a canonical list of which chars must be on keyboard?
I would probably want to put the brackets, #, | etc back :-)
I've grabbed the Russian files from Leipzig ... they have other-Soviet-non-Russian sources which I am ignoring for now.
Cheers, Ian
Leipzig has some stats. Can't see letter freq
https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013
but here is punctuation: https://cls.corpora.uni-leipzig.de/en/rus_mixed_2013/2.1.4_Special%20Characters.html
Contains characters that I would typically ignore.
Thank you for taking a look, @iandoug!
I have no idea what the complete list of characters should contain...
Russian alphabet (WP order) АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя
Other characters on current Russian (Windows) keyboard: 1234567890!"№;%:?*₽()-_=+\/.,
Others on US ANSI: @#$^&[]{}|`~'<>
russian-dirty.txt : 1,697,253,979 bytes, 963,432,405 characters according to wc russian-clean1.txt: 1,219,345,880 bytes, 670,203,855 characters according to wc
I used following chars: АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя1234567890!№;%:?*₽()-_=+/.,«»" plus space, tab and enter Included the «» because quite common, and in layout above. Fixed assorted dashes and ellipsis.
Char freq and bigram counts attached. russianfreq1.txt rawfollow-russian1.csv
Busy generating 1MB chained bigrams but its struggling a bit so don't know how good the final result will be.
Sample of ignored text attached.
Chained bigrams. russianmonkeytest-1MB.zip
Amazing -- Thank you for cleaning up the corpus! You know, I would write to those who maintain the original corpus so that they host your cleaned version and credit you accordingly.
It would be better to generate Щ after double click Ш and Ъ after double click Ь. These two letters are used really rare unlike Home and Page Up keys.
I think all punctuation of engram in the middle must be preserved. You can also combine on one button if needed: (е, ё), (и, й), (ь, ъ), (ш, щ), ... . I would prefer in syntax symbols engram layout and on free numbers querty. (engram over querty) It will be some kind of hybrid. Also Home and Page Up / Down must be on separate keys from characters/numbers layout. Right now I use EgroDone with engram and russian (querty). So russian layout repeat qwerty on this keyboard.
@iandoug -- Any thoughts on the above comments, or do you think your corpus and tables are ready?
What do the two columns of numbers represent in russianfreq1.txt?
I am having difficulty parsing the rawfollow-russian1.csv file. To create an engram-russian layout, I just need bigrams and bigram frequencies. Would you be able to create a table with these?
No rush on responding to any of the above, as I have plenty to do, and I think https://github.com/binarybottle/engram/issues/58 is a more pressing challenge.
Those are probably wrong, will send revised versions in the week.
Thank you, but again no rush -- I just want to make sure I have the right data to work with when I get to this in the future.
A particularly annoying problem is the Ё
. This letter has been basically an afterthought for quite a while, and I wonder if the corpus might be contaminated by people substituting it with Е
for basically a century at this point.
I'm also quite curious how to fit the "extra" letters that the Russian alphabet requires on, say, ortholinear split keyboard. Even the base English version requires some care.
Probably need another issue opened for this, but it's the problem that made me look at this page in first place. I've been trying to consider how to even try to fit Engram on my Sofle (mirrored split, each side being 4x6 main key area and 5 below that of which 2 are thumb) and I've realized that it would require shuffling quite a bit around. Which would have been fine if I didn't also need to have ЙЦУКЕН (JCUKEN) around, which is mapped with QWERTY keyboard in mind, and so. I suppose on home desktop I can do whatever to make things work okay, but it still makes for an awkward setup.
I received an email last week: