HughP / dnj-corpus

A small corpus of a local newspaper
Other
3 stars 2 forks source link

Character set for round 1 #5

Closed iandoug closed 6 years ago

iandoug commented 6 years ago

Hi

Your "waiting for doctor" texts contain strings of the form de dha dedewo.

Any idea what the broken characters are? I looked via my hex editor which seems to imply that they are hex 1E which is "record separator" ... in which case we should strip them from the text. Just worried that they may be standing in for one of the tone indicators since they are at beginning and end of words... or maybe it's a by-product of however the document was originally typeset.

I have removed about 21 instances of the French/Dan beginners' guide text. From my point of view, if we want to optimise for Dan, then having bits of French in the mix is going to give misleading results.

However there are still a few "foreign" words in the mix that contain letters not in Dan, like "j". These are proper names etc.

Regarding the assorted quotes, is the plan to replace all typographical quotes with straight ASCII quotes (both single and double)? And then use French-style chevron quotes for speech?

Does Dan use apostrophes?

Thanks, Ian

iandoug commented 6 years ago

Re ϋ U+03CB: GREEK SMALL LETTER UPSILON WITH DIALYTIK

There are a lot of cases where this character was achieved by not by using above Unicode character but by combining upsilon with dialytik after the fact. I assume this is because of how they capture it on the keyboard. Let's see if a copy-paste enlarged will help:

ʋ̈ ϋ

The first char above is as-is in the text (badly rendered in my browser), the second is U+03CB inserted via a character-picker tool.

So KLA is not going to handle the two-step method, only the plain U+03CB one-key (or AltGr-u / whatever) method.

BTW I assume it is upsilon character? Please advise if not :-)

iandoug commented 6 years ago

@HughP "Yes they need to go to commas, however in the Perl regex that I’ve been using it somehow destroys the file." At present I'm just playing in Kate editor, search-replace. Let me try those. Okay that worked. Here is where I am at present, please take a look (as opposed to officially adding to repo via Git .... maybe I'm working in different direction to you.) proof-of-concept-text-sans-fr-tone-fixed-01.txt

iandoug commented 6 years ago

"Download the readme.md and look in the html comments." Github really needs a "quote text in reply" function ...

Okay had a look at the comments in the readme.

Which leads to further issues... firstly this font that Github uses is not helping. font-family (stack): -apple-system,BlinkMacSystemFont,"Segoe UI",Helvetica,Arial,sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol" Font being rendered: Helvetica

So here's a screenshot of some letters, using Symbola, which does a much better job. (click for cinemascope) problem-chars-1

Issues:

  1. is Ëë accidentally duplicated? (in your alphabet list)

  2. Ii Ɩɩ are unnecessarily similar, especially at small sizes, and will likely be problematic when hand-written. I accept that at this point this may not be changeable, but since Dan does not use Jj, they may be a better choice and the lower case at least is more distinguishable. Also compare vs lower case t: ɩt

  3. as mentioned previously, Uu Üü Ϋϋ Ʋʋ Yy are going to give trouble. If I read Unicode correctly, then Ʋʋ are sounds rather than letters. Also, the <ɩ, ʋ ʋ̈> section is lifted from your readme, and I notice that it has ʋ̈, while that letter/sound does not appear in your alphabet. This kinda proves my point, that it is going to cause confusion, either with u or Y. So possibly Ʋʋ (the sounds) do not belong in the alphabet, and ʋ ʋ̈ should not be there either? They are also unnecessarily too similar to plain uü... maybe something like Ūū could be used instead?

Not sure how much flexibility we have at this point ... at least the written corpus is way smaller than English, and this early on the written language is still clearly evolving, so whatever goes into the Bible will set the standard for years to come....

iandoug commented 6 years ago

Okay, re

Ϋϋ Ʋʋ

I decided to have a go at making a keyboard, using Matthew chps 5 to 7 as input, after cleaning up a bit. I chose this because:

  1. single source
  2. presumably done by people who know what they are doing
  3. picked those 3 chapters because should minimise references to Jesus and God (to avoid upsetting the letter distribution), and won't have all those "begats" or "was the son of" (for same reason).
  4. verse numbers add some numerals to the mix.
  5. smaller size is better for KLA.

Along the way discovered that they use ʋ U+028B LATIN SMALL LETTER V WITH HOOK, with U+0308 COMBINING DIAERESIS. I replaced all of those with ϋ U+03CB GREEK SMALL LETTER UPSILON WITH DIALYTIKA, and I'm starting to think that different people are picking one or the other to use, and that the language does not have both. So I'm going with 3CB because that is a Unicode character while the other is two characters merged by software.

My character set that satisfies Matt 5 - 7, plus standard ANSI characters, plus French quotes, now looks like this, as a PHP variable:

$chars="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890`~!@#$%^&*()--_=+|]}[{" . '"' . "';:/?.>,< ëɛƆÜËƐöɔɩ꞊‹›«»°üʋÖϋ";

Note it has the proper short = as well as the normal Maths =. I note that not all the capital letters from Hugh's alphabet are included above, I'll add them in as needed.

Have attached the input text. Might be copyright infringement. Or fair use. :-) They also used a soft non-breaking hyphen or somesuch for tone marker, I switched all of those to normal ASCII hypen-minus. Text uses proper ꞊ for tone. Is the jury still out on whether to use this, or simply use normal Maths = ? While on the topic, any thoughts on copying the Nigerian methodology for tone? That requires the diacritic apostrophe as extra key, but then don't need ꞊ as extra key, and leave normal apostrophe (and double apostrophe) open for use as in English.

Char counts are:

(space) : 4093 (dash) : 1760 (markdown turns dash into bullet...) a : 1546 ' : 1019 h : 989 n : 742 d : 732 ɛ : 718 ö : 661 k : 650 b : 538 ë : 449 ɔ : 406 g : 395 o : 356 y : 350 ꞊ : 346 i : 330 ü : 302 w : 301 " : 273 u : 257 m : 252 p : 249 s : 198 t : 193 , : 191 z : 127 . : 106 e : 84 l : 81 ϋ : 68 A : 61 Y : 49 ʋ : 47 K : 45 2 : 42 1 : 42 : : 41 ɩ : 34 M : 32 D : 29 3 : 27 » : 23 « : 23 4 : 21 ! : 16 r : 15 f : 14 5 : 12 7 : 12 6 : 12 ? : 12 8 : 11 9 : 10 0 : 9 W : 8 ) : 6 ; : 6 S : 6 ( : 6 › : 6 ‹ : 6 B : 5 Z : 5 I : 3 Ɛ : 3 N : 3 P : 2 F : 2 Ö : 1

matthew-5-6-7.txt