Open AKmatiAK opened 2 years ago
I updated my layout, here it is:
altgr layer remains the same
Please take a look at the optimized layout for Spanish (https://github.com/binarybottle/engram-es) that we created.
If there is very accurate and representative 1-gram and bigram frequency data for the Polish language (including symbols), then we could apply a modified version of the code to generate an Engram layout optimized for Polish.
Hi. I contacted polish corpus creators and I got data up to 5-grams. It's available here: n-grams pl data. Do I need to further process it or it's enough?
also, here is list of avaible resources that might be useful: link
This corpus looks pretty official! I like that it has a broad variety of book and news sources. Too bad it doesn't include spoken transcripts or social media sources. Anyway, I would be happy to help with this but it will be a couple of months before I can get to it -- buried with projects right now.
Ok, thank u for help ;)
@iandoug -- Given your experience helping to clean up the Spanish corpus, do you have any concerns about the proposed Polish corpus?: http://zil.ipipan.waw.pl/NKJPNGrams
Hi Arno
For keyboard layout use, I prefer to strip texts not normally typed on computer keyboard (like spoken transcripts or tweets) because that will mess up the character frequencies and n-grams.
"Each unigram is maximum continuous chunk of non-whitespace lower-case characters."
That is the normal way of doing it. Ian of course is not normal and does it Case Sensitive ... :-) Because typing Th is different to th.
It looks like they only have "1-million-word subcorpus" available to download.
Is this typical Polish text?
Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.
Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.
I have not seen any HTML etc, but since this is "manually annotated", I guess it is "clean" in that regard. There are some ALL CAPS sentences.
Will see if I can extract the text and do some analysis over the weekend. We are currently having rolling blackouts so that messes up plans.
For future reference: Leipzig
https://wortschatz.uni-leipzig.de/en/download/Polish#pol_newscrawl_2011
What's the difference in quotes? Should both be on keyboard?
która znalazła się w zestawieniu "Billboard Magazine".
” 2012, a zespół otrzymał nominację do nagród „Songlines Music Awards” 2012 w kategorii „Best Group”. ” (2012) oraz realizator dźwięku przy filmie „
Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.
yes, they're for dialogues, yet not often used for things other than books.
What's the difference in quotes? Should both be on keyboard?
bottom quotes are rarely used, shouldn't be on keyboard (they are now superseeded by both upper quotes and meaning is the same)
Is this typical Polish text?
Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.
2,3 typical, 1 is correct but rather from books
Thank you for taking a look, @iandoug! I don't know Polish, so I will defer to @AKmatiAK and other Polish speakers/typists. A corpus of only 1 million words is pretty small, but I hope it represents what people type.
I took a look at the linked corpus, not wild about it, seems to contain a lot of dialogue. Will try cleaning up some of it as next step after this.
Instead, I grabbed all the1M files from the Leipzig Polish corpus. After looking at those, decided to only use the "news" files, the rest is going to be a mess to clean. So that supplies 9 million sentences.
After tweaking my Spanish cleanup program, now have a 688 MB text file to play with. I grabbed some Polish books from Gutenbreg ... only a few, most seem to be poetry or dialogue-heavy novels. Will try my usual "extract some text" approach with those to add to the Leipzig file.
Current char distribution looks like this. Provisional list, may change ...
Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)
112 characters?
@iandoug -- Using news files sounds reasonable, but I wouldn't throw out dialogues -- they are far closer to how people type emails than books are.
I took further look at NJKP n-grams and they're heavily bloated with parliament sessions transcriptions or something like this, so they're pretty useless. news/internet is the way to go. I'll take a look at leipzig files.
Sample from "Web" corpus attached.
Will do your "single-case" frequencies and bigrams in due course.
The dialogues all like this:
% short sentence 1. % short sentence 2. % short sentence 3.
where % is the - character. Markdown getting in the way again.
idk how to read n-grams from leipzig. Is there any instruction for this?
Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)
112 characters?
We use ISO keyboard, same as in US here. Both ISO and ANSI. Polish characters on altgr. 112 characters without space and enter
First attempt at bigrams. Am playing with trial layout, I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.
Also UDHR in Polish as temporary test file. The UN no longer seems to have .txt downloads, just PDF on on web page.
udhr-polish.txt bigrams-polish1.csv
csv is tab-separated.
Most common: ie ni na ow st ze cz rz po ch an ra pr wi zy ro ia za wa ta dz sz od ki en ko ar ej mi li ci zi ac
I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.
honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).
I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.
honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).
Yeah, I was surprised at what Wikipedia said about that. At the moment I have enough keys on ANSI, though must put Euro somewhere. Spanish and French more tricky because multiple diacritics per vowel. You only have 2 on Z.
Here's first attempt at chained bigrams, since the UDHR character frequency is not very good. But not happy with this file either, has too many digits. Which is a consequence of the "news" input I suppose.
Okay finally got somewhere but it "feels" a mess, probably because I know nothing about Polish. But it will give you something to compare against.
Ignore the layouts with .en. in the name, they are missing the Polish letters so their scores are wrong.
The bottom one is the "Programmer" layout which WP says is the most common. It might make sense to put some of these letters on their own keys, instead of Q V X since these 3 are not native Polish and thus rare. Or at least ł.
ł ord 197 hex c5 8602003 ż ord 197 hex c5 4490722 ó ord 195 hex c3 4461711 ś ord 197 hex c5 2988511 ć ord 196 hex c4 2271532
Enough for today. Getting better.
Been playing around. Current best version, changes my be too dramatic for easy acceptance.
Can compare performance against default Programmer version at bottom of list. Ignore .en. layouts.
Think I need the diacritic S letters on separate key, which means switching to ISO form factor.
Hand balance is 58:42, but can't find spot on right for popular letters on left ...
This is my current layout I was creating since about month by simple intuition and applying fixes based on what I thought should be changed etc. so it might be useful to some extent in designing engram-pl. It lacks some keys I know because I changed it frequently. in my subjecive opinion, cie trigram is very frequent and should be placed on keyboard (but I may be wrong). Also, mixing different letters on one key is not very good idea imo, it might be faster but is unintuitive. only ź should be placed on another letter, also placing ł on i instead of L is reasonable because i found it easy to remember somehow.
btw: what I like a lot in ISO is far better thumb access to altgr and one more letter at home row. I couldn't achieve it on ANSI and because that I sticked with my old ISO one.
of course I have caps and ctrl swapped ;)
Mmm so of course you would use a form factor that is not in KLA ..... neither ANSI nor ISO :-)
sz is a common bigram so should not be on same finger.
Q V X are not in your alphabet so it makes no sense to waste whole keys on them. They are only there because of QWERTY.
I made an ISO version, realised I had the spacebar on the wrong thumb, so had to basaclly mirror the layout to fix it.
Hand balance is nearly perfect now. ANSI version slightly better, but ISO puts the space bars further away and there's nothing I can do about that. Other metrics are better.
I may have used the wrong input file to create the chained bigrams, so redid it.
The Q X V can be put in better places ... first get the Polish to work :-)
@iandoug -- Thank you for hitting this hard over the weekend! I am slammed this week but hope to take a look at what you're doing next weekend.
Was not intending to but once you start fiddling with layouts ... like a drug :-)
Also have other stuff to do this week, will ty to improve corpus when I have time.
sz is a common bigram so should not be on same finger.
my right fingers position is ATZS + altgr edit: left hand JCIE + space. this way I use pinky only for ctrl (on caps) and my hand position is more straight. and right pinky for enter
Also why ó and ł are on such strange positions? They are typing distance optimized too?
Is this correct? I added the pipe character "|" back.
Also why ó and ł are on such strange positions? They are typing distance optimized too?
Your accented characters seem to be almost treated as separate letters rather than "stressed" versions of the version without the diacritic. Certainly judging by the frequency of some of them.
Those letters are where they need to be so that the layout scores well. Your layout scores better than the default, but could do a lot better. Will send screenshot if above layout is correct.
I joined the few books I had together and cleaned up the unwanted characters. The file is 1,116,925 bytes.
The character frequency came out as iaeoznsrcwymtdkpł,ujl.bęgąhżśó-ć!ńPWAO;NfTZD?:"IźSRKC_JMBG*L10EU[]Ł824'F5HŻ736v)Ś9VY(xXq=/ÓQĘĄŃŹĆ{~&^`
while the frequency for the Leipzig "news" files is
aieonzrwstyckdpmujlł.bg,ęhążóś⮠PćfWS-0"KńMN1TA2ZBDORJCIG:LE53U4)(ź9F?687H!VvŚŁ/ŻxX'Y%;q+Q&ĘŹ@`ĆÓĄ*>~][$Ń<_=€#|^}{
The book's order for the most frequent characters is different, probably a consequence of using the main character's name a lot. I normally just take short extracts of books to avoid this, but don't have enough to do that (besides having somehow lost the program I wrote to do that).
So don't think I will include these texts. Will see what I can get out of the "official" corpus posted above.
The problem with that corpus is that it is intended for "parts of speech" analysis, not "what do people type on keyboards" like we need.
Is this correct? I added the pipe character "|" back.
Will send screenshot if above layout is correct.
only difference is minus sign on different key, in place where middle dot was, but this doesn't matter. my layout also has problem with lack of greater/equal. for score this shouldn't make difference
ah I forgot, there isn't "!" near N
you can put greater/equal on slashes, the only reason I didn't do this is because I want to make this keys a modifiers in future and I didn't wanted to memorize those keys there.
Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072
Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072
That's probably both cases merged ... the order of the most frequent is same as mine up to around c / y. I will make unicase list, also bigrams, for Arno. Thanks.... gives me confidence in my corpus.
ah I forgot, there isn't "!" near N
There were two !, I removed the one on the letter keys and left the one on 1 as "standard". I had trouble with - and _ , the font on your keyboard is not so clear. Where should minus be? I did not see middle dot ...
I need to write a checker program to check layouts for all needed characters and no duplicates. Will add < and > to yours.
How can I make layout for KLA? You can send me it and I will modify.
! on dot is better, you can delete it from 1, it's not needed to make it more straightforward because my layout lacks too you can simply swap - and on my layout you created in KLA.
I uploaded a playground. Your layout is the second one (click Configure at top), please fix, then export the json and send to me to replace. ian@keyboard-design.com
https://klanext.keyboard-design.com/pl/
Thanks :-)
Hey I've made layouts for a lot of languages in the past, and coincidentally I was actually thinking today about making something for Polish! My analyzer can be found here, it's written in rust and comes with a useful repl to interact with it.
I'm using corpora from Leipzich Wortschatz, also mentioned by Ian earlier. I know these are not fully representative of casual texting and everyday typing, but having compared some news(crawls) and similar between different corpora for English I'm pretty confident they're very close to being representative in any case. A lot of word usage between news articles and websites ends up being the same as more casual usage of the language.
For corpus processing, I transpose everything to lowercase including punct, meaning _
becomes -
, "
becomes '
, etc and toss out numbers and their corresponding punctuation. I also transpose some variations of certain punctuation, mostly different quotation marks, to their ascii version. In this step I also tag on an accent key (denoted with *
), with the following functionality:
*a -> ą
*o -> ó
*z -> ź
*s -> ś
*c -> ć
*n -> ń
Seems pretty self-explanatory. For the eventual layout, you can implement these with a dead key. You might notice ł
, ę
and ż
are missing however and you would be right, as those get their own dedicated key on the keyboard, courtesy of them being a lot more common than q
, v
and x
. For punctuation I use .
, ,
and '
, which gives us the following 30 keys to use for keyboard layout generation:
a b c d e f g h i j k l m n o p ł r s t u ę w ż y z ' , . *
This does denote one of the limitations of my analyzer, in that it can only optimize for the main 3x10 keys and nothing around it. In this case that is fine however, since there aren't any keys left out as it stands.
From there, I can run generate 2500
in my analyzer, which does all the work for us! Polish seems to be a weird language in that it has a lot of keys between 3 and 1% freq, rather than having some high usage keys and then usage falling off more quickly. This meant that creating nice pinky columns got quite hard. A solution to this could be to add ę
to the accent key for example and remove its dedicated key, but then you have to hit more keys in the end which doesn't seem super ideal. It might be worth it though and is probably worth exploring.
Some of the layouts I found were:
ł t s k j f . e u '
r c n w m , z a o i
l d b p g ż * ę y h
Sfb: 1.116%
Dsfb: 7.935%
Finger Speed: 5.798
[0.382, 0.543, 0.806, 1.507, 0.578, 0.602, 0.734, 0.646]
Scissors: 0.291%
Inrolls: 23.094%
Outrolls: 23.163%
Total Rolls: 46.258%
Onehands: 1.086%
Alternates: 35.394%
Alternates (sfs): 9.812%
Total Alternates: 45.206%
Redirects: 3.277%
Bad Redirects: 0.182%
Total Redirects: 3.459%
Bad Sfbs: 0.541%,
Sft: 0.011%
This rcnw
variant, which besides the relatively wonky left pinky seems pretty amazing. Both high rolls and high alternation with very low redirects, but it's got relatively high finger speed. As far as I've seen though, it appears to be very difficult to suppress that much further.
ż t r w p f . e y '
s c n k m l z a o i
g d ł b j , * ę u h
Sfb: 1.025%
Dsfb: 8.153%
Finger Speed: 5.631
[0.261, 0.543, 0.433, 1.581, 0.831, 0.602, 0.734, 0.646]
Scissors: 0.541%
Inrolls: 26.256%
Outrolls: 21.175%
Total Rolls: 47.431%
Onehands: 1.155%
Alternates: 33.333%
Alternates (sfs): 9.484%
Total Alternates: 42.817%
Redirects: 4.611%
Bad Redirects: 0.244%
Total Redirects: 4.855%
Bad Sfbs: 0.477%,
Sft: 0.012%
This scnk
variant, which has slightly higher scissors but less sfbs, and should be another sound option. It does have g
on the bottom row however, which is the case because gd
occurs around 0.06%. You could probably move g
to top row and be completely fine though.
Any thoughts? I might play around with it tomorrow. By the the way, may keyboard layout playground has Polish too now, so you can play around with these (or any other layouts posted here) over there as well. Good luck yall!
Hi. I used your bigram data and layouts, and improved my layout basing on this, while trying to change it as little as possible to don't have to learn new from scratch again :P I also corrected few fingers for pressing keys for more accurate representation of layout. fwyr nowy.txt
@O-X-E-Y is this analyzer only for ortho keyboards?
btw it lacks ó and ź
The analyzer currently only supports 3x10, but the heatmap it uses is made for rowstag so it does optimize for that (angle mod specifically). Also ó
and ź
are there, that's what the accent key is for
Pretty nice. Maybe I use one of your layouts? I checked mine and it's just worse so I'm going to start pain of learning again :P I may also check results of your layouts in KLA so we can compare it to Ian's and find which is best.
Ian needs to redo ... the problem is that KLA does not support "magic diacritic keys" like Oxey's analyzer.... only AltGr style.
scnk in KLA scores similar to ian10, so it looks like your layouts are close to ideal. scnk
Just stepping back into this exchange after some time away (Halloween costume complete!). Is there still an interest in running the Engram protocol on excerpts from the Polish corpus to optimize Engram for Polish?
Ian and oxey optimized it close to limit, so if it would take a lot of effort it's not neccesary.
I am wary of standard optimization criteria when it comes to evaluating comfortable rather than efficient typing, but if you are happy with it, that's great!
Is it hard to get engram layout for language like polish? I created my layout based on engram, but maybe it can be further optimized.