A common "international" base layout for English/German + Spanish/Portuguese/French

leogama commented 1 year ago

Hello, there. I'm an enthusiast user of the (programmer's) Dvorak layout for almost a decade now, and it was a huge improvement over good ol' QWERTY to learn it. However, while it is really widespread and readily available on most current systems, its performance for the English language is sub-optimal. Also, its variations for languages with similar alphabets —like my dear Portuguese— are still "super-terrible" (a bit less terrible than QWERTY due to the vowels at the left home row).

The elephant in the room

I took a look at some of these newer designs, including yours. Congratulations, by the way! Amazing work. But the OP touched a very important point that is still unaddressed by all of these: we live in an international, interconnected world now. Until the early 2000's, it wasn't a problem to have totally different keyboard layouts for every language. We even used different, incompatible text encodings! But now the most used encoding in both new devices and the Internet is Unicode. I believe the same transition should happen to keyboard layouts.

But is there a need for it? Well, most professionals that type a lot (journalists, academics, programmers, etc.) will need to either create content in more than one language, usually in their native one and in English, or at least communicate with foreigners through text often. It applies even to countries that have English as their primary language, like the US, where there's more and more people speaking Spanish as a primary or secondary language each year (> 50 million today).

Is an "international" keyboard layout possible?

I know that many languages use completely different alphabets and, even when they use similar ones (like variations of the Latin or Cyrillic scripts), they have extra characters and wildly varying letter/n-gram frequencies. Therefore, there can't be a truly international base layout for keyboards. But can we do better?

Starting from English, the de facto international language, a non-monolingual layout can't be much distant from ASCII. Looking at the languages with most speakers in the world that use a Latin script alphabet, we have in the top positions (Wikipedia/Ethnologue 2022):

Position	Language	Family	Branch	1st language	2nd language	Total speakers
1	English	Indo-European	Germanic	372.9 million	1.080 billion	1.452 billion
4	Spanish	Indo-European	Romance	474.7 million	73.6 million	548.3 million
5	French	Indo-European	Romance	79.9 million	194.2 million	274.1 million
9	Portuguese	Indo-European	Romance	232.4 million	25.2 million	257.7 million
12	German	Indo-European	Germanic	75.6 million	59.1 million	134.6 million

I think it would be feasible to analyse these 5 languages, from two branches of the same language family —you already did it for two of them― and find a design that is awesome for one (likely English) but doesn't sucks for all the others.

A "Latin" or "Romance-Germanic" base keyboard layout

For whoever is interested, I propose the development of a base layout using the Latin alphabet that is optimized for all of these 5 languages. It wouldn't be a simple weighted optimization though. What I would expect to achieve with this design is:

To have a common base for creating a new layout for each of the 5 languages;
It must be really good at English, at least as good as other current designs by the same metrics;
It should be reasonably good for the other 4 languages, but must not be terrible for any of them;
The differences between the layouts should be minimal, so that one can constantly switch between layouts without hassle, create a custom hybrid bilingual layout or don't even need to switch at all.

Steps necessary to achieve these goals:

Obtain a text corpus and n-gram frequency for French, German and Portuguese;
Find the similarities between the 5 languages using some kind of distance measure(s);
Define optimization weights for them considering these similarities, number of speakers, etc.;
Develop a method for searching the layout space by optimizing primarily for English and secondarily for the 4 other languages (using the weights), with penalization if the layout starts becoming too bad for any single language middle search;
Choose a winner base layout and then search for full layouts for each individual language, positioning specific keys and maybe repositioning some punctuation keys in the process.
Profit. 😎

Advantages

Beyond the obvious advantages for multilingual typists, this base layout and its derivatives would benefit from having a unified, larger user base —likely very small in the beginning, but it's plausible to reach a critical size eventually.
Its software implementations could have common codebases, following the pattern of a base layout file (either the English layout or just the base itself) and modifications of it. Would be easier to maintain and port to different systems.
Being multilingual could be an eye-catching feature for anyone looking for a better layout to learn beyond QWERTY/Dvorak.
The new methods developed could be useful for custom/personal layout creation and also for other language subfamilies, like those that use the Cyrillic script.

I'm seriously considering to learn once more a new keyboard layout, but it would have to be a killer layout. It would have to be one to rule them all.

I am willing to dedicate some time to this idea if there are others interested. If not, maybe I'll end up trying to create my own Portuguese or Portuguese-English Engram layout.

Greetings from Brazil! 🇧🇷

Originally posted by @leogama in https://github.com/binarybottle/engram-es/issues/40#issuecomment-1526811911

leogama commented 1 year ago

@binarybottle, feel free to move this issue to binarybottle/engram if you think it's appropriate (there would be more people to discuss).

binarybottle commented 1 year ago

@leogama -- Thank you very much for the suggestion! I like the idea of a Latin-script, Esperanto-esque optimized key layout. I'm not sure about calling it "Latin" because it wouldn't be optimized for the Latin language, but other possible names include "esp" (for "Esperanto", reading minds, or "especial"), "indoeuro" or "ie", or "ESF" from "English-Spanish-French".

For such an undertaking, we would need a representative corpus combining all three languages, and I know of no one better suited to the challenge than @iandoug. Ian -- what do you think???

iandoug commented 1 year ago

FWIW, I have been busy for a while with a similar project. It started with a desire to support our 11 official languages, but we have so many other people from north of us here as well, as wall as moves to introduce Swahili as a language in schools, that I thought I might as well look at the wider Southern-African region.

The relevance here is that Afrikaans already uses a lot of diacritics. We have German speakers in Namibia, Portuguese in Angola and Mozambique, and one country speaking Spanish.

Afrikaans is close to Dutch. West Africa is French.

So I added all the diacritics to my Poqtea layout, and tested it against a small corpus with all the languages (using the Universal Declaration of Human Rights available from Unicode).

Poqtea does well with all of the languages except one, where it is average (rather than good or terrible).

But the corpus is not very good, and needs to be bigger.

I have already collected the files for various languages from Uni Leipzig site, I need to clean them and get the character and bigram frequencies.

Let me see what I have in that regard.

Attached layout is current version of design. It allows typing in a multitude of Southern African languages, as well as (at least) the following Euro languages:

English, (Afrikaans), Portuguese, Spanish, French, Italian, German, Dutch, Turkish.

The design supports dead keys via the "compose" function, but diacritic letters for these languages are also directly typeable via the Blue (a renamed AltGr) and Green (new) modifiers.

The main character area is 48 keys as per ISO boards, so they could be put on ISO. Just need to repurpose some on the useless Windows keys to be Compose and Green.

I realise this is not Engram, but it provides something to measure against.

For Poster's needs, we can remove the diacritics used in the African languages like ṋ, ṱ, š, etc.

janiso-gen-α-slab-v14-caps-main

iandoug commented 1 year ago

French project: https://en.wikipedia.org/wiki/AZERTY#/media/File:Azerty_NFZ71-300.png https://norme-azerty.fr/en/
Scroll down to Documents section it bottom to see how they want to handle Greek, currencies, etc. They included Bitcoin, which I don't think will survive the rise of Central Bank digital currencies (they will simply outlaw all non-official ones), but missed the Thai Baht.

German project: https://en.wikipedia.org/wiki/German_keyboard_layout https://de.wikipedia.org/wiki/DIN_2137

But these are basically tweaks to AZERTY and QWERTZ.

US International: https://en.wikipedia.org/wiki/QWERTY#US-international

Proposal for Italian: https://www.farah.cl/DistribucionesDeTeclado/NuovItal/index_en.html

ADNW for German and more. http://www.adnw.de/index.php?n=Main.HomePage

Letters: My research so far shows that you would need to support this:

Latin alphabet A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

Indo-Arabic digits 0 1 2 3 4 5 6 7 8 9

Other general punctuation and symbols ! @ # & * ^ % ( ) { } [ ] - _ / ? \ | ' " , < . > ; : = + ~ `

English á à ç é è ê ë ï ñ ô ö æ œ Á À Ç É È Ê Ë Ï Ñ Ô Ö Æ Œ

Portuguese á â ã à ç é ê è í ì ï ó ô õ ò ú ù ü Á Â Ã À Ç É Ê È Í Ì Ï Ó Ô Õ Ò Ú Ù Ü

French é à è ù â ê î ô û ë ï ü ÿ ç É À È Ù Â Ê Î Ô Û Ë Ï Ü Ÿ Ç

German ä ö ü ß Ä Ö Ü ẞ

Italian é ó à è ì ò ù î É Ó À È Ì Ò Ù Î

Turkish ç ğ ı ş Ç Ğ İ Ş

Dutch á â ä é è ê ë í ï ó ô ö ú û ü ĳ ȷ Á Â Ä É È Ê Ë Í Ï Ó Ô Ö Ú Û Ü Ĳ

Other characters which appear on various European keyboards: ñ Ñ ç Ç £ € ‹ › « » ª º § ¿ ¡ ¬ ° μ ¤

Assorted combining diacritics.

So it's a lot.

iandoug commented 1 year ago

EURkey: https://eurkey.steffen.bruentjen.eu/

binarybottle commented 1 year ago

@iandoug -- Amazing work!!! Thank you for sharing your progress.

If you have a suitable text corpus for African languages, then I would love to work with you on an engram variant to support these!
Would you be able to help with a corpus for English-Spanish-French-Portuguese?
Thank you for listing all of the letters, numerals, and symbols used in European typewriting. Do you know what the intersection would be for just English, Spanish, and French?

iandoug commented 1 year ago

Re (1), Africa is huge with a multitude of languages, not all of which use the Latin alphabet.

Nigeria tried to create a pan-Nigerian layout: https://en.wikipedia.org/wiki/Pan-Nigerian_alphabet

Basically hacked QWERTY, not optimised at all.

A few years back I worked with Hugh from SIL on two projects, one of which was a keyboard for one African language. They use tone marks etc, which are more frequent than most letters.

https://github.com/HughP/dnj-corpus/issues/20 https://en.wikipedia.org/wiki/Dan_language

And that's just one language. Pan-African support as not viable, which is why I limited myself to "most" Southern African (basically from around equator south) and which use Latin alphabet.

Down here, there is another twist, Bantu and KhoiSan languages use an assortment of click sounds, which not on the keyboard (they are in IPA). https://en.wikipedia.org/wiki/Khoisan_languages

If you go north you hit Arabic. Go east, hit Ethiopic. https://en.wikipedia.org/wiki/Tigre_language

So Pan-African non-trivial.

Re (2), Yes, need to make wider one for my own needs. Will weight the parts by number of L1 and L2 speakers.

Re (3), Will make intersection.

Will also scan font collection to see what font support is like.

Cheers, Ian

iandoug commented 1 year ago

BTW did we do something with Russian? Can't find link but I did get corpus, and recently discovered some errors (handling unicode properly..)

Might even be similar issues with the Spanish and Polish. Will relook at it.

binarybottle commented 1 year ago

@iandoug --

The variety of African languages (and second languages) sounds rich and complex. Do you think it makes more sense to tackle each language individually, or are there enough 1st- and 2nd-language writers that it makes sense to optimize one or more hybrid layouts? You mentioned Swahili -- would that be a reasonable 2nd-language candidate?
Which Latin-script languages do you think a single layout should support? English-Spanish-French-Portuguese, or more? I am concerned about weighing things down with diacriticals (German adds quite a bit on top, doesn't it?). Weighting letter and bigram frequencies by number of speakers makes sense.
Multiple intersections would be better, based on an increasing number of included languages -- for example, English-Spanish-French-Portuguese, English-Spanish-French-Portuguese-German,...

iandoug commented 1 year ago

German diacritics are few compared to Dutch/French/Portuguese.

I had an comment to me on Reddit that EURkeys mentioned above is Western Europe rather than Eastern/Northern Europe (those using Latin script).

That may suggest a logical split. The new German layouts are attempting to cater to the languages around them, including things like Sami.

There are so many diacritics though. Even adding them all as dead keys, and labelling the caps, leads to a cluttered keyboard. I know my Janiso layout is also very cluttered, that's partly because I was trying to avoid dead keys (or Linux Combine method) and instead place all the needed diacritic versions on a layer.

Even out current US/UK layouts do not support English fully. Adding the missing bits á à ç é è ê ë ï ñ ô ö æ œ Á À Ç É È Ê Ë Ï Ñ Ô Ö Æ Œ

takes you a long way to supporting French, Dutch, Spanish and German. Portuguese is another level up.

Am building a database query tool to see what as needed for the different languages.

binarybottle commented 1 year ago

Thank you, @iandoug -- I appreciate your building a query tool to distinguish between the languages!

Diacritics on a dedicated layer makes sense to me to avoid conflicts and clutter.

From the diacriticals you shared, it looks like we would need the following to support every (?) W European language:

2 accents for 2 letters (a/A, e/E)
1 umlaut for 3 letters (e/E, i/I, o/O)
1 circumflex for 2 letters (e/E, o/O)
1 cedilla for 1 letter (n/N)
2 pairs of combined letters (æ/Æ, œ/Œ)

If two dedicated keys were used in conjunction with A, E, I, O, N, and two arbitrary letters (for æ and œ), then they could cover all of the above except for e/E umlaut and e/E circumflex.

iandoug commented 1 year ago

Bepo.

Personally, it may be too complicated. According to English WP, both this and revised Azerty have been accepted by French standards authorities, but have had little uptake.

I think manufacturers are reluctant to retool. Or the need for new drivers for Windoze is holding them back.

https://bepo-fr.translate.goog/wiki/Caract%C3%A8res_pris_en_charge?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp&_x_tr_sch=http

linked from https://bepo.fr/wiki/Accueil

then Googlised.

On related issue, have you got your layouts using magic diacritic key working in practice?

iandoug commented 1 year ago

The truth is somewhere between what these guys:

http://diacritics.typo.cz/index.php?id=49

and these guys:

https://hyperglot.rosettatype.com/

and the various WP pages (eg, "spanish alphabet") say.

I've posted some corrections to Hyperglot for Afrikaans, so other languages may also have issues.

olddaniel commented 7 months ago

Better than fixing an international design that's equilibrated between several languages, but hardly excels at any, I'd suggest we shift our focus towards making some kind of software that spits the optimized layout based on the user's 1 or 2 languages he is going to be typing most of the day. So for example, I write 80% of my day in English and 20% in Portuguese. If I could insert those 2 languages and their weights on a keyboard analyzer using the Engram math/logic, it would give me the perfect layout for my use case, whilst for another user it could be a different layout at 10% English and 90% Portuguese, etc.

olddaniel commented 7 months ago

Ok, after much effort I put together an adaptation of the English Engram that has all the diacritical accents needed to by Portuguese speakers (while still being optimized for the English language). Here is the keylayout file (for Mac OS): https://drive.google.com/file/d/1DBLHpBnFlDoDfmZ38-y7qsPiZweY2yFM/view?usp=sharing

binarybottle / engram