How to create engram for Polish?

AKmatiAK commented 2 years ago

Is it hard to get engram layout for language like polish? I created my layout based on engram, but maybe it can be further optimized.

ukladpl ukladplAltGr

AKmatiAK commented 2 years ago

I updated my layout, here it is:

FWYR2 altgr layer remains the same

binarybottle commented 2 years ago

Please take a look at the optimized layout for Spanish (https://github.com/binarybottle/engram-es) that we created.

If there is very accurate and representative 1-gram and bigram frequency data for the Polish language (including symbols), then we could apply a modified version of the code to generate an Engram layout optimized for Polish.

AKmatiAK commented 2 years ago

Hi. I contacted polish corpus creators and I got data up to 5-grams. It's available here: n-grams pl data. Do I need to further process it or it's enough?

also, here is list of avaible resources that might be useful: link

binarybottle commented 2 years ago

This corpus looks pretty official! I like that it has a broad variety of book and news sources. Too bad it doesn't include spoken transcripts or social media sources. Anyway, I would be happy to help with this but it will be a couple of months before I can get to it -- buried with projects right now.

AKmatiAK commented 2 years ago

Ok, thank u for help ;)

binarybottle commented 2 years ago

@iandoug -- Given your experience helping to clean up the Spanish corpus, do you have any concerns about the proposed Polish corpus?: http://zil.ipipan.waw.pl/NKJPNGrams

iandoug commented 2 years ago

Hi Arno

For keyboard layout use, I prefer to strip texts not normally typed on computer keyboard (like spoken transcripts or tweets) because that will mess up the character frequencies and n-grams.

"Each unigram is maximum continuous chunk of non-whitespace lower-case characters."

That is the normal way of doing it. Ian of course is not normal and does it Case Sensitive ... :-) Because typing Th is different to th.

It looks like they only have "1-million-word subcorpus" available to download.

Is this typical Polish text?

Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

I have not seen any HTML etc, but since this is "manually annotated", I guess it is "clean" in that regard. There are some ALL CAPS sentences.

Will see if I can extract the text and do some analysis over the weekend. We are currently having rolling blackouts so that messes up plans.

iandoug commented 2 years ago

For future reference: Leipzig

https://wortschatz.uni-leipzig.de/en/download/Polish#pol_newscrawl_2011

iandoug commented 2 years ago

What's the difference in quotes? Should both be on keyboard?

która znalazła się w zestawieniu "Billboard Magazine".

” 2012, a zespół otrzymał nominację do nagród „Songlines Music Awards” 2012 w kategorii „Best Group”. ” (2012) oraz realizator dźwięku przy filmie „

AKmatiAK commented 2 years ago

Do lines starting with a dash indicate dialogue? There seems to be a lot of that, which is going to bump up the dash frequency. Should maybe strip out leading dashes.

yes, they're for dialogues, yet not often used for things other than books.

What's the difference in quotes? Should both be on keyboard?

bottom quotes are rarely used, shouldn't be on keyboard (they are now superseeded by both upper quotes and meaning is the same)

Is this typical Polish text?

Masz ty duszę? Powiedz! Tak jest. Od ręki. To chyba dobra formuła.

2,3 typical, 1 is correct but rather from books

binarybottle commented 2 years ago

Thank you for taking a look, @iandoug! I don't know Polish, so I will defer to @AKmatiAK and other Polish speakers/typists. A corpus of only 1 million words is pretty small, but I hope it represents what people type.

iandoug commented 2 years ago

I took a look at the linked corpus, not wild about it, seems to contain a lot of dialogue. Will try cleaning up some of it as next step after this.

Instead, I grabbed all the1M files from the Leipzig Polish corpus. After looking at those, decided to only use the "news" files, the rest is going to be a mess to clean. So that supplies 9 million sentences.

After tweaking my Spanish cleanup program, now have a 688 MB text file to play with. I grabbed some Polish books from Gutenbreg ... only a few, most seem to be poetry or dialogue-heavy novels. Will try my usual "extract some text" approach with those to add to the Leipzig file.

Current char distribution looks like this. Provisional list, may change ...

char-dist-1.txt

polishfreq1.txt

iandoug commented 2 years ago

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

binarybottle commented 2 years ago

@iandoug -- Using news files sounds reasonable, but I wouldn't throw out dialogues -- they are far closer to how people type emails than books are.

AKmatiAK commented 2 years ago

I took further look at NJKP n-grams and they're heavily bloated with parliament sessions transcriptions or something like this, so they're pretty useless. news/internet is the way to go. I'll take a look at leipzig files.

iandoug commented 2 years ago

Sample from "Web" corpus attached.

Will do your "single-case" frequencies and bigrams in due course.

web-sample.txt

The dialogues all like this:

% short sentence 1. % short sentence 2. % short sentence 3.

where % is the - character. Markdown getting in the way again.

AKmatiAK commented 2 years ago

idk how to read n-grams from leipzig. Is there any instruction for this?

Gonna have to do this on ISO not ANSI ... need that extra key. Even then, going to be challenging. :-)

112 characters?

~~We use ISO keyboard, same as in US here.~~ Both ISO and ANSI. Polish characters on altgr. 112 characters without space and enter

iandoug commented 2 years ago

First attempt at bigrams. Am playing with trial layout, I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

Also UDHR in Polish as temporary test file. The UN no longer seems to have .txt downloads, just PDF on on web page.

udhr-polish.txt bigrams-polish1.csv

csv is tab-separated.

Most common: ie ni na ow st ze cz rz po ch an ra pr wi zy ro ia za wa ta dz sz od ki en ko ar ej mi li ci zi ac

AKmatiAK commented 2 years ago

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

iandoug commented 2 years ago

I see from Wikipedia that people actually prefer the ANSI boards over ISO so am using that.

honestly I bought one ANSI keyboard and don't like it, but it shouldn't be a problem to make two variants of layout? I checked what layout currently sold in Poland keyboards have and no standard is followed, both ISO and ANSI are used (not like I previously wrote, I though that ISO is standard here).

Yeah, I was surprised at what Wikipedia said about that. At the moment I have enough keys on ANSI, though must put Euro somewhere. Spanish and French more tricky because multiple diacritics per vowel. You only have 2 on Z.

Here's first attempt at chained bigrams, since the UDHR character frequency is not very good. But not happy with this file either, has too many digits. Which is a consequence of the "news" input I suppose.

polishmonkeytest.txt

iandoug commented 2 years ago

Okay finally got somewhere but it "feels" a mess, probably because I know nothing about Polish. But it will give you something to compare against.

Ignore the layouts with .en. in the name, they are missing the Polish letters so their scores are wrong.

The bottom one is the "Programmer" layout which WP says is the most common. It might make sense to put some of these letters on their own keys, instead of Q V X since these 3 are not native Polish and thus rare. Or at least ł.

ł ord 197 hex c5 8602003 ż ord 197 hex c5 4490722 ó ord 195 hex c3 4461711 ś ord 197 hex c5 2988511 ć ord 196 hex c4 2271532

polish-test-1 ian pl ansi

iandoug commented 2 years ago

Enough for today. Getting better. polish-test-2 ian2 pl ansi

iandoug commented 2 years ago

Been playing around. Current best version, changes my be too dramatic for easy acceptance.

Can compare performance against default Programmer version at bottom of list. Ignore .en. layouts.

polish-test-3 ian8 pl ansi

iandoug commented 2 years ago

Think I need the diacritic S letters on separate key, which means switching to ISO form factor.

iandoug commented 2 years ago

Hand balance is 58:42, but can't find spot on right for popular letters on left ...

AKmatiAK commented 2 years ago

This is my current layout I was creating since about month by simple intuition and applying fixes based on what I thought should be changed etc. so it might be useful to some extent in designing engram-pl. It lacks some keys I know because I changed it frequently. in my subjecive opinion, cie trigram is very frequent and should be placed on keyboard (but I may be wrong). Also, mixing different letters on one key is not very good idea imo, it might be faster but is unintuitive. only ź should be placed on another letter, also placing ł on i instead of L is reasonable because i found it easy to remember somehow.

btw: what I like a lot in ISO is far better thumb access to altgr and one more letter at home row. I couldn't achieve it on ANSI and because that I sticked with my old ISO one. keyboard-layout(2)

of course I have caps and ctrl swapped ;)

iandoug commented 2 years ago

Mmm so of course you would use a form factor that is not in KLA ..... neither ANSI nor ISO :-)

sz is a common bigram so should not be on same finger.

Q V X are not in your alphabet so it makes no sense to waste whole keys on them. They are only there because of QWERTY.

I made an ISO version, realised I had the spacebar on the wrong thumb, so had to basaclly mirror the layout to fix it.

Hand balance is nearly perfect now. ANSI version slightly better, but ISO puts the space bars further away and there's nothing I can do about that. Other metrics are better. ian10 pl iso ian10 pl ansi polish-test-4

I may have used the wrong input file to create the chained bigrams, so redid it.

polishmonkeytest2.txt

iandoug commented 2 years ago

The Q X V can be put in better places ... first get the Polish to work :-)

binarybottle commented 2 years ago

@iandoug -- Thank you for hitting this hard over the weekend! I am slammed this week but hope to take a look at what you're doing next weekend.

iandoug commented 2 years ago

Was not intending to but once you start fiddling with layouts ... like a drug :-)

Also have other stuff to do this week, will ty to improve corpus when I have time.

AKmatiAK commented 2 years ago

sz is a common bigram so should not be on same finger.

my right fingers position is ATZS + altgr edit: left hand JCIE + space. this way I use pinky only for ctrl (on caps) and my hand position is more straight. and right pinky for enter

Also why ó and ł are on such strange positions? They are typing distance optimized too?

iandoug commented 2 years ago

Is this correct? I added the pipe character "|" back. akmatiak pl iso

iandoug commented 2 years ago

Also why ó and ł are on such strange positions? They are typing distance optimized too?

Your accented characters seem to be almost treated as separate letters rather than "stressed" versions of the version without the diacritic. Certainly judging by the frequency of some of them.

Those letters are where they need to be so that the layout scores well. Your layout scores better than the default, but could do a lot better. Will send screenshot if above layout is correct.

iandoug commented 2 years ago

I joined the few books I had together and cleaned up the unwanted characters. The file is 1,116,925 bytes.

The character frequency came out as iaeoznsrcwymtdkpł,ujl.bęgąhżśó-ć!ńPWAO;NfTZD?:"IźSRKC_JMBG*L10EU[]Ł824'F5HŻ736v)Ś9VY(xXq=/ÓQĘĄŃŹĆ{~&^`

while the frequency for the Leipzig "news" files is

aieonzrwstyckdpmujlł.bg,ęhążóś⮠PćfWS-0"KńMN1TA2ZBDORJCIG:LE53U4)(ź9F?687H!VvŚŁ/ŻxX'Y%;q+Q&ĘŹ@`ĆÓĄ*>~][$Ń<_=€#|^}{

The book's order for the most frequent characters is different, probably a consequence of using the main character's name a lot. I normally just take short extracts of books to avoid this, but don't have enough to do that (besides having somehow lost the program I wrote to do that).

So don't think I will include these texts. Will see what I can get out of the "official" corpus posted above.

The problem with that corpus is that it is intended for "parts of speech" analysis, not "what do people type on keyboards" like we need.

AKmatiAK commented 2 years ago

Is this correct? I added the pipe character "|" back.

Will send screenshot if above layout is correct.

only difference is minus sign on different key, in place where middle dot was, but this doesn't matter. my layout also has problem with lack of greater/equal. for score this shouldn't make difference

ah I forgot, there isn't "!" near N

you can put greater/equal on slashes, the only reason I didn't do this is because I want to make this keys a modifiers in future and I didn't wanted to memorize those keys there.

AKmatiAK commented 2 years ago

Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072

iandoug commented 2 years ago

Here's another letter frequency data: https://sjp.pwn.pl/poradnia/haslo/;7072

That's probably both cases merged ... the order of the most frequent is same as mine up to around c / y. I will make unicase list, also bigrams, for Arno. Thanks.... gives me confidence in my corpus.

iandoug commented 2 years ago

ah I forgot, there isn't "!" near N

There were two !, I removed the one on the letter keys and left the one on 1 as "standard". I had trouble with - and _ , the font on your keyboard is not so clear. Where should minus be? I did not see middle dot ...

I need to write a checker program to check layouts for all needed characters and no duplicates. Will add < and > to yours.

AKmatiAK commented 2 years ago

How can I make layout for KLA? You can send me it and I will modify.

! on dot is better, you can delete it from 1, it's not needed to make it more straightforward because my layout lacks too you can simply swap - and on my layout you created in KLA.

iandoug commented 2 years ago

I uploaded a playground. Your layout is the second one (click Configure at top), please fix, then export the json and send to me to replace. ian@keyboard-design.com

https://klanext.keyboard-design.com/pl/

Thanks :-)

o-x-e-y commented 2 years ago

Hey I've made layouts for a lot of languages in the past, and coincidentally I was actually thinking today about making something for Polish! My analyzer can be found here, it's written in rust and comes with a useful repl to interact with it.

I'm using corpora from Leipzich Wortschatz, also mentioned by Ian earlier. I know these are not fully representative of casual texting and everyday typing, but having compared some news(crawls) and similar between different corpora for English I'm pretty confident they're very close to being representative in any case. A lot of word usage between news articles and websites ends up being the same as more casual usage of the language.

For corpus processing, I transpose everything to lowercase including punct, meaning _ becomes -, " becomes ', etc and toss out numbers and their corresponding punctuation. I also transpose some variations of certain punctuation, mostly different quotation marks, to their ascii version. In this step I also tag on an accent key (denoted with *), with the following functionality:

*a -> ą
*o -> ó
*z -> ź
*s -> ś
*c -> ć
*n -> ń

Seems pretty self-explanatory. For the eventual layout, you can implement these with a dead key. You might notice ł, ę and ż are missing however and you would be right, as those get their own dedicated key on the keyboard, courtesy of them being a lot more common than q, v and x. For punctuation I use ., , and ', which gives us the following 30 keys to use for keyboard layout generation:

a b c d e f g h i j k l m n o p ł r s t u ę w ż y z ' , . *

This does denote one of the limitations of my analyzer, in that it can only optimize for the main 3x10 keys and nothing around it. In this case that is fine however, since there aren't any keys left out as it stands.

From there, I can run generate 2500 in my analyzer, which does all the work for us! Polish seems to be a weird language in that it has a lot of keys between 3 and 1% freq, rather than having some high usage keys and then usage falling off more quickly. This meant that creating nice pinky columns got quite hard. A solution to this could be to add ę to the accent key for example and remove its dedicated key, but then you have to hit more keys in the end which doesn't seem super ideal. It might be worth it though and is probably worth exploring.

Some of the layouts I found were:

ł t s k j  f . e u '
r c n w m  , z a o i
l d b p g  ż * ę y h
Sfb:  1.116%
Dsfb: 7.935%
Finger Speed: 5.798
    [0.382, 0.543, 0.806, 1.507, 0.578, 0.602, 0.734, 0.646]
Scissors: 0.291%

Inrolls: 23.094%
Outrolls: 23.163%
Total Rolls: 46.258%
Onehands: 1.086%

Alternates: 35.394%
Alternates (sfs): 9.812%
Total Alternates: 45.206%

Redirects: 3.277%
Bad Redirects: 0.182%
Total Redirects: 3.459%

Bad Sfbs: 0.541%,
Sft: 0.011%

This rcnw variant, which besides the relatively wonky left pinky seems pretty amazing. Both high rolls and high alternation with very low redirects, but it's got relatively high finger speed. As far as I've seen though, it appears to be very difficult to suppress that much further.

ż t r w p  f . e y '
s c n k m  l z a o i
g d ł b j  , * ę u h
Sfb:  1.025%
Dsfb: 8.153%
Finger Speed: 5.631
    [0.261, 0.543, 0.433, 1.581, 0.831, 0.602, 0.734, 0.646]
Scissors: 0.541%

Inrolls: 26.256%
Outrolls: 21.175%
Total Rolls: 47.431%
Onehands: 1.155%

Alternates: 33.333%
Alternates (sfs): 9.484%
Total Alternates: 42.817%

Redirects: 4.611%
Bad Redirects: 0.244%
Total Redirects: 4.855%

Bad Sfbs: 0.477%,
Sft: 0.012%

This scnk variant, which has slightly higher scissors but less sfbs, and should be another sound option. It does have g on the bottom row however, which is the case because gd occurs around 0.06%. You could probably move g to top row and be completely fine though.

Any thoughts? I might play around with it tomorrow. By the the way, may keyboard layout playground has Polish too now, so you can play around with these (or any other layouts posted here) over there as well. Good luck yall!

AKmatiAK commented 2 years ago

Hi. I used your bigram data and layouts, and improved my layout basing on this, while trying to change it as little as possible to don't have to learn new from scratch again :P I also corrected few fingers for pressing keys for more accurate representation of layout. fwyr nowy.txt

AKmatiAK commented 2 years ago

@O-X-E-Y is this analyzer only for ortho keyboards?

btw it lacks ó and ź

o-x-e-y commented 2 years ago

The analyzer currently only supports 3x10, but the heatmap it uses is made for rowstag so it does optimize for that (angle mod specifically). Also ó and ź are there, that's what the accent key is for

AKmatiAK commented 2 years ago

Pretty nice. Maybe I use one of your layouts? I checked mine and it's just worse so I'm going to start pain of learning again :P I may also check results of your layouts in KLA so we can compare it to Ian's and find which is best.

iandoug commented 2 years ago

Ian needs to redo ... the problem is that KLA does not support "magic diacritic keys" like Oxey's analyzer.... only AltGr style.

AKmatiAK commented 2 years ago

scnk in KLA scores similar to ian10, so it looks like your layouts are close to ideal. scnk

binarybottle commented 2 years ago

Just stepping back into this exchange after some time away (Halloween costume complete!). Is there still an interest in running the Engram protocol on excerpts from the Polish corpus to optimize Engram for Polish?

AKmatiAK commented 2 years ago

Ian and oxey optimized it close to limit, so if it would take a lot of effort it's not neccesary.

binarybottle commented 2 years ago

I am wary of standard optimization criteria when it comes to evaluating comfortable rather than efficient typing, but if you are happy with it, that's great!

binarybottle / engram

How to create engram for Polish? #46