HughP / dnj-corpus

A small corpus of a local newspaper
Other
3 stars 2 forks source link

issues with cleanup #30

Closed iandoug closed 6 years ago

iandoug commented 6 years ago

Just putting this stuff in separate thread, #9 was getting off topic.

Your script cleanup for spaces around period and comma uses same instruction, think you forgot to change period to comma. Not quite sure I follow what you are trying to do there, so will leave that point for now. (How I understand the regex may be wrong, but at the moment it seems different to the text description of what you want to do.)

HughP commented 6 years ago

Thanks for that catch. Yes that was a mistake. the correct syntax is 's/\s[,](?=\s)/\s\N{U+002C}/g' The correct syntax was in the readme file. but thanks for the catch. I just corrected it and pushed again.

iandoug commented 6 years ago

Would the accent on the final e in në˗nu bha ꞊në ꞊misié be a typo?

HughP commented 6 years ago

Yes. should be ë I think there are something like six cases of this sort of thing.

On Wed, Jun 20, 2018 at 12:32 PM, Ian Douglas notifications@github.com wrote:

Would the accent on the final e in në˗nu bha ꞊në ꞊misié be a typo?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/HughP/dnj-corpus/issues/30#issuecomment-398869526, or mute the thread https://github.com/notifications/unsubscribe-auth/AACdJiB1ksYjA32mPBW0BCJTDETiZHqSks5t-qNegaJpZM4Uv0oR .

HughP commented 6 years ago

This is in the bottom of the read me... basically I am not sure where the, COMBINING MACRON, or the DIAERESIS, cases are located in the corpus. The other cases are likely mis-spellings once the French is removed.

U+FFF9      17  INTERLINEAR ANNOTATION ANCHOR
U+0304      1   COMBINING MACRON
U+2013  –   1064    EN DASH
U+00E7  ç   21  LATIN SMALL LETTER C WITH CEDILLA
U+00E8  è   221 LATIN SMALL LETTER E WITH GRAVE
One or two non-French cases of mistyping
U+00E9  é   107 LATIN SMALL LETTER E WITH ACUTE
U+00EA  ê   28  LATIN SMALL LETTER E WITH CIRCUMFLEX
ʼö ya ˗a ˗ga ˗sê --> e+diaeresis others are french
U+00EE  î   3   LATIN SMALL LETTER I WITH CIRCUMFLEX
U+00FB  û   26  LATIN SMALL LETTER U WITH CIRCUMFLEX
U+00A8  ¨   1   DIAERESIS
iandoug commented 6 years ago

Yes. should be ë I think there are something like six cases of this sort of thing.

After I removed the French/Dan lessons, am left with ꞊misié –Dukwitaa (Duekoué) ˮpɛɛn (which I guess may be a name) Perɛzë (Bakué mi) (likewise)

Current bottom of my character frequency table is

J | U+004A | 9 | Latin Capital Letter J | R | U+0052 | 8 | Latin Capital Letter R |   … | U+2026 | 7 | Horizontal Ellipsis |   ê | U+00EA | 7 | Latin Small Letter E With Circumflex | Latin Small Letter E Circumflex û | U+00FB | 5 | Latin Small Letter U With Circumflex | Latin Small Letter U Circumflex c | U+0063 | 4 | Latin Small Letter C |   î | U+00EE | 3 | Latin Small Letter I With Circumflex | Latin Small Letter I Circumflex é | U+00E9 | 3 | Latin Small Letter E With Acute | Latin Small Letter E Acute \ | U+005C | 1 | Reverse Solidus | Backslash ¨ | U+00A8 | 1 | Diaeresis | Spacing Diaeresis ̄ | U+0304 | 1 | Combining Macron | Non-spacing Macron x | U+0078 | 1 | Latin Small Letter X |   ° | U+00B0 | 1 | Degree Sign |  

The three é are accounted for, which leaves just one combining macron and one diaeresis to track down. The x is in "4 x 4" ... can't avoid that.

î be here: Benoît. ˗Aga ˗dhɛkpaɔyi benoît).» ʼYö misiee Benoît –dhɛ ʼBɛnö Oitö. Guess middle one should also be capitalised as name.

û be here: ʼDannëngdhö ˮkwi –bha ˮsu ʼö –kë Août ʼka, ˗a ʼgü. –A zü –yö dhɛ ʼö ˗kë ʼUtö ( Août ) ʼka ʼö dɔ Zue ʼö –kë Août ʼka, ˗a ʼgü. –A which I guess could all be names...

I have provided for both é and î and û on current development keyboard layout.

iandoug commented 6 years ago

Also this still:

̈ | U+0308 | 4444 | Combining Diaeresis | Non-spacing Diaeresis

4444 is a lot. And that's after fixing the hook-v and upsilons.

iandoug commented 6 years ago

Think you need your variant of this too:

// fix en-dash U+2013 to LETTER MINUS $line = preg_replace('/\x{2013}/u',"˗",$line); #

I see it in the comment for 10a but looks like you only switched normal hyphen-minus to letter-minus.

iandoug commented 6 years ago

basically I am not sure where the, COMBINING MACRON, or the DIAERESIS, cases are located in the corpus.

Combining macron is on the b in ˗b̄ha here: ʼmɛ ʼö ˗a ga ˗yö ʼko ˮdhiʋ̈ ˗dhɛ ˗nu ꞊wa ʼto ˗nɛɛsü, ꞊ya nëng kë ˗a ˮyi mü ˗sü ˗b̄ha, ˗yö ˗kë ˮklʋ̈ʋ̈klʋ̈.

Diaeresis is on second a in waa¨ here: waa¨ʼwëë˗ ˮgblü ˮsɔɔdo

I guess the b-macron is some sort of typo? Maybe someone trying to do a ƃ U+0183 LATIN SMALL LETTER B WITH TOPBAR as overkill / between two worlds (ƃ being bh).

iandoug commented 6 years ago

After cleanup, following lines with c, j or q remain. All foreign. Also lots of Abidjan. Just find it weird that Dan has no J or C, but their language code is DNJ and currency is CAF and country is Cote 'd Ivoire ... Note typo in currency below. I made provision for (French) Franc sign on keyboard, don't know if they ever use it.

(ʼMilan Ac) ˗kë ʼgü ʼwo˗ ˗dhɛ UNICEF Mɛsë ˗dede ʼö ꞊slɔɔ ʼSedeao (Cedeao kwɛɛ (2007) (Cedeao), kö ˗wa ˗pö ʼka ˗a ˗naa ʼwëë˗ ˗kɔ ʼö ʼwo˗dhɛ BCEAO bha, ʼyö ˗kë ˗a gɔ ˗mɛ ˗nu ʼgü ˗së kö ˮwëë˗ bha˗ ˗dhɛ FESPACO. ꞊Ya kë OMS, ʼö˗ do ʼbha ʼwo˗ ˗dhɛ UNICEF ( ONUDC ), ˗a ˗nu ˗bha didhɛtëë ʼka, ʼyö (ANARIZ˗CI). ʼWɔn ʼö ˗gban ˗sü ˗bha, ꞊në ʼö ANARIZ˗CI bha yë ʼka. ˗A ʼö ʼwo˗ suu ˗dhɛ NERICA ( 110.000.000.000 cfa ) 4 x 4, ˗a ˗nu ˮsaɔplɛ (7), ꞊poponi ˗ (ONUCI) ˗yö ö bha pë ˗dhɛ ONUCI˗ ˗bha ꞊dɛɛ ˗yënng ya˗ ʼgü 1( Chiondes I ) ˗nu ʼwo˗ ˗nu ONISI (ONUCI) ˗nu › ˮKwigɔn Banggbö ʼprɛnngsü ( Prince) ʼwo˗ ˗dhɛ INISƐFË (UNICef) in frence». ˮNʋnʋ bha, ˗yö ꞊bɛdhɛ ˮdhɔɔ ˗dɔ ʼg

iandoug commented 6 years ago

Of the 4444 combining diaeresis, 1757 are with the lower hook v 795 with e 1230 with o 638 with u 3 with O 12 with U

Mmm that's not quite 4444 but after running fixups I don't have any more combining diaeresis left.

// clean up other funnies $line = preg_replace("/b̄h/u",'bh',$line); # has macron on b... $line = preg_replace("/a¨/u",'ä',$line);
$line = preg_replace('/\x{2026}/u','...',$line); # horizontal ellipsis -> ... $line = preg_replace('/\x{FFF9}/u','',$line); # remove Interlinear Annotation Anchor $line = preg_replace('/\x{000C}/u','',$line); # remove Form Feeds ... should not be here at this point. // Get rid of combining diacritics $line = preg_replace('/\x{0065}\x{0308}/u','ë',$line); #795 with e $line = preg_replace('/\x{006f}\x{0308}/u','ö',$line); #1230 with o $line = preg_replace('/\x{0075}\x{0308}/u','ü',$line); #638 with u $line = preg_replace('/\x{004f}\x{0308}/u','Ö',$line); #3 with O $line = preg_replace('/\x{0055}\x{0308}/u','Ü',$line); #12 with U

HughP commented 6 years ago

@iandoug The following could be simplified in php by using the normalization class: http://php.net/manual/en/class.normalizer.php

// Get rid of combining diacritics

$line = preg_replace('/\x{006f}\x{0308}/u','ö',$line); #1230 with o
$line = preg_replace('/\x{0075}\x{0308}/u','ü',$line); #638 with u
$line = preg_replace('/\x{004f}\x{0308}/u','Ö',$line); #3 with O
$line = preg_replace('/\x{0055}\x{0308}/u','Ü',$line); #12 with U

http://php.net/manual/en/class.normalizer.php

HughP commented 6 years ago

Combining macron is on the b in ˗b̄ha here: ʼmɛ ʼö ˗a ga ˗yö ʼko ˮdhiʋ̈ ˗dhɛ ˗nu ꞊wa ʼto ˗nɛɛsü, ꞊ya nëng kë ˗a ˮyi mü ˗sü ˗b̄ha, ˗yö ˗kë ˮklʋ̈ʋ̈klʋ̈.

Diaeresis is on second a in waa¨ here: waa¨ʼwëë˗ ˮgblü ˮsɔɔdo

I guess the b-macron is some sort of typo? Maybe someone trying to do a ƃ U+0183 LATIN SMALL LETTER B WITH TOPBAR as overkill / between two worlds (ƃ being bh).

Yes just kill the macron then as a typo. There is a look -alike character like this in Liberia for Capital B with hook. but bh is already b with hook.

HughP commented 6 years ago

We don't have an a with diaeresis, so this is also likely a typo for a tone mark, but which one? just zap it.

HughP commented 6 years ago

î be here: Benoît. ˗Aga ˗dhɛkpaɔyi benoît).» ʼYö misiee Benoît –dhɛ ʼBɛnö Oitö. Guess middle one should also be capitalised as name.

û be here: ʼDannëngdhö ˮkwi –bha ˮsu ʼö –kë Août ʼka, ˗a ʼgü. –A zü –yö dhɛ ʼö ˗kë ʼUtö ( Août ) ʼka ʼö dɔ Zue ʼö –kë Août ʼka, ˗a ʼgü. –A

My take on the things inside parens () is that they are French words or Abbreviations, which in West Africa names are really long "to be formal", so the real name becomes the Abbreviations... that is an interesting dynamic for these small languages.

iandoug commented 6 years ago

Mmm that's not quite 4444 but after running fixups I don't have any more combining diaeresis left.

Incorrect.... there were still all those on hook v plus a few more. Tracked them down to this:

  1. several at start of line, not on a letter, so don't show up in text.
  2. one double on the middle o here: (that's a combining diaeresis on top of ö) ˗Bhöpë ˗nu ʼö̈ kwa ˗bhawɔn

so fix: $line = preg_replace('/^\x{0308}/u','',$line); // some on no letter, at start of line $line = preg_replace('/\x{00f6}\x{0308}/u','ö',$line); //1 with ö

After which am left with 1757, same as number of hook v with diaeresis.

I think the issue with (1) above is (partly) what caused the "kill blank lines" procedures to fail. I say partly because I still see the occasional double blank line, there must be some other invisible character hiding there. Will see if some code can expose them like (2) above.

So at moment cleanup looks okay, just need to sort out issues re spaces around dashes and comma and period. Also want to look at reflowing some of those numerous short lines. Problem is telling the difference between a paragraph or line broken into one-or-two word lines, and section headings.

iandoug commented 6 years ago

I think the issue with (1) above is (partly) what caused the "kill blank lines" procedures to fail. I say partly because I still see the occasional double blank line, there must be some other invisible character hiding there. Will see if some code can expose them like (2) above.

Na, looks like it was just the hidden combining diaeresis. Once I remove them, then cat -s seems to get rid of the duplicate blank lines fine. Must be subtle bug (probably related to French stripping) that is not doing it 100% in the PHP.

Wonder what traffic French stripping will send from Google...