HughP / dnj-corpus

A small corpus of a local newspaper
Other
3 stars 2 forks source link

What to do with non-visable characters #6

Closed iandoug closed 6 years ago

iandoug commented 6 years ago

Re Point 12 in Readme as of Sat 9 June GMT

U+0009 482 
U+000A 30690 
U+000C 220 
U+000D 1340 
U+001E 5442 
U+0020 124711

Suggestions: U+0009 == Tab: ignore or replace with 4 spaces. We can accept keyboard will have tab key in fixed location. U+000A 30690 == line feed: delete U+000C 220 == form feed: delete U+000D 1340 == enter/return: leave. KLA uses this as enter/return. U+001E 5442 == record separator: delete U+0020 124711 == space: leave

I'm also picking up these other "space" characters: (counts are from my current version of your mass-text file after assorted edits) : 3269 : 59 : 17 : 1 I think these are assorted "non-breaking-spaces" or similar, courtesy of the word processing software used, Unicode has a whack:

U+2000 EN QUAD U+200A HAIR SPACE U+00A0 NO-BREAK SPACE ‎​ U+200B ZERO WIDTH SPACE ‎⁠ U+2060 WORD JOINER ‎ U+3000 IDEOGRAPHIC SPACE ‎ U+FEFF ZERO WIDTH NO-BREAK SPACE

Just trying to persuade PHP to show me the U+... form, think I need to switch to PHP 7 to get that functionality. Or try and find them in the hex editor.

HughP commented 6 years ago

I'm also picking up these other "space" characters: (counts are from my current version of your mass-text file after assorted edits) : 3269 : 59 : 17 : 1 I think these are assorted "non-breaking-spaces" or similar, courtesy of the word processing software used, Unicode has a whack:

U+2000 EN QUAD
U+200A HAIR SPACE
U+00A0 NO-BREAK SPACE
‎​U+200B ZERO WIDTH SPACE
‎⁠U+2060 WORD JOINER
‎U+3000 IDEOGRAPHIC SPACE
‎U+FEFF ZERO WIDTH NO-BREAK SPACE

Just trying to persuade PHP to show me the U+... form, think I need to switch to PHP 7 to get that functionality. Or try and find them in the hex editor.

@iandoug Yes, some of those characters need to move to symbols. I handled those in the script version of the corpus. I don't think you have some of these code points in the corpus, for instance do you really have a U+2060?

HughP commented 6 years ago

@iandoug did you ever figure something out for php? There seem to be some ideas here: https://stackoverflow.com/a/35213288

Fixed thees issues in the latest release of the generate-corpus.bash script.

iandoug commented 6 years ago

Yes, somewhere along the way I stumbled across the json_encode trick, but it's not a complete solution since that function leaves normal low-ASCII chars as-is. So wrote some code that handled both cases. Used it for checking the fonts for required characters. Don't want to move to PHP 7 because I have some old systems that will break in PHP 7. Needs a major rewrite... :-(

iandoug commented 6 years ago

I know it's closed, that list of "space characters" with Unicode code points was not from the file, but just suggestions to what the unknown characters could be. Currently have these undesirable characters in the file, after using my "strip French" program (which BTW I think misses a few French words ... will revert on that.).

: U+000A : 13342 (line feed .. should remove) : U+0308 : 3269 (combining diaeresis)... : U+000D : 897 (carriage return, Okay). : U+000C : 110 (form feed, should remove) : U+FFF9 : 17 (interlinear annotation anchor ... should probably remove) ‚ : U+201A : 7 (single low quotation mark... think you removed those) … : U+2026 : 7 (ellipsis) : U+0009 : 1 (tab) ¨ : U+00A8 : 1 (diaeresis) : U+0304 : 1 (combining macron)

HughP commented 6 years ago

Here is the characters in the corpus as I still have them:

Big things left are converting hyphen-minus to letter minus. and making sure there are not double carriage returns. Then there are a couple of one offs that I need to hunt down and fix.

code point glyph count character name
U+0009      240 CHARACTER TABULATION
U+000D      16763   CARRIAGE RETURN
U+0020      78603   SPACE
U+0021  !   68  EXCLAMATION MARK
U+0028  (   482 LEFT PARENTHESIS
U+0029  )   483 RIGHT PARENTHESIS
U+002A  *   20  ASTERISK
U+002C  ,   4754    COMMA
U+002D  -   27453   HYPHEN-MINUS
U+002E  .   4170    FULL STOP
U+002F  /   17  SOLIDUS
U+0030  0   867 DIGIT ZERO
U+0031  1   301 DIGIT ONE
U+0032  2   436 DIGIT TWO
U+0033  3   136 DIGIT THREE
U+0034  4   110 DIGIT FOUR
U+0035  5   181 DIGIT FIVE
U+0036  6   81  DIGIT SIX
U+0037  7   160 DIGIT SEVEN
U+0038  8   268 DIGIT EIGHT
U+0039  9   116 DIGIT NINE
U+003A  :   488 COLON
U+003B  ;   79  SEMICOLON
U+003F  ?   201 QUESTION MARK
U+0041  A   1044    LATIN CAPITAL LETTER A
U+0042  B   421 LATIN CAPITAL LETTER B
U+0043  C   15  LATIN CAPITAL LETTER C
U+0044  D   767 LATIN CAPITAL LETTER D
U+0045  E   108 LATIN CAPITAL LETTER E
U+0046  F   97  LATIN CAPITAL LETTER F
U+0047  G   448 LATIN CAPITAL LETTER G
U+0048  H   26  LATIN CAPITAL LETTER H
U+0049  I   66  LATIN CAPITAL LETTER I
U+004A  J   9   LATIN CAPITAL LETTER J
U+004B  K   1223    LATIN CAPITAL LETTER K
U+004C  L   145 LATIN CAPITAL LETTER L
U+004D  M   668 LATIN CAPITAL LETTER M
U+004E  N   353 LATIN CAPITAL LETTER N
U+004F  O   50  LATIN CAPITAL LETTER O
U+0050  P   301 LATIN CAPITAL LETTER P
U+0052  R   8   LATIN CAPITAL LETTER R
U+0053  S   479 LATIN CAPITAL LETTER S
U+0054  T   274 LATIN CAPITAL LETTER T
U+0055  U   50  LATIN CAPITAL LETTER U
U+0056  V   121 LATIN CAPITAL LETTER V
U+0057  W   508 LATIN CAPITAL LETTER W
U+0059  Y   976 LATIN CAPITAL LETTER Y
U+005A  Z   385 LATIN CAPITAL LETTER Z
U+005B  [   10  LEFT SQUARE BRACKET
U+005C  \   1   REVERSE SOLIDUS
U+005D  ]   10  RIGHT SQUARE BRACKET
U+0061  a   29819   LATIN SMALL LETTER A
U+0062  b   9792    LATIN SMALL LETTER B
U+0063  c   436 LATIN SMALL LETTER C
U+0064  d   12033   LATIN SMALL LETTER D
U+0065  e   5895    LATIN SMALL LETTER E
U+0066  f   429 LATIN SMALL LETTER F
U+0067  g   10265   LATIN SMALL LETTER G
U+0068  h   15281   LATIN SMALL LETTER H
U+0069  i   8555    LATIN SMALL LETTER I
U+006A  j   71  LATIN SMALL LETTER J
U+006B  k   11970   LATIN SMALL LETTER K
U+006C  l   3992    LATIN SMALL LETTER L
U+006D  m   4357    LATIN SMALL LETTER M
U+006E  n   16349   LATIN SMALL LETTER N
U+006F  o   10298   LATIN SMALL LETTER O
U+0070  p   4505    LATIN SMALL LETTER P
U+0071  q   103 LATIN SMALL LETTER Q
U+0072  r   1762    LATIN SMALL LETTER R
U+0073  s   6593    LATIN SMALL LETTER S
U+0074  t   3753    LATIN SMALL LETTER T
U+0075  u   7969    LATIN SMALL LETTER U
U+0076  v   468 LATIN SMALL LETTER V
U+0077  w   8275    LATIN SMALL LETTER W
U+0078  x   85  LATIN SMALL LETTER X
U+0079  y   7427    LATIN SMALL LETTER Y
U+007A  z   1964    LATIN SMALL LETTER Z
U+00A8  ¨   1   DIAERESIS
U+00AB  «   204 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00B0  °   1   DEGREE SIGN
U+00BB  »   198 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00CB  Ë   46  LATIN CAPITAL LETTER E WITH DIAERESIS
U+00D6  Ö   73  LATIN CAPITAL LETTER O WITH DIAERESIS
U+00DC  Ü   71  LATIN CAPITAL LETTER U WITH DIAERESIS
U+00E7  ç   21  LATIN SMALL LETTER C WITH CEDILLA
U+00E8  è   221 LATIN SMALL LETTER E WITH GRAVE
U+00E9  é   107 LATIN SMALL LETTER E WITH ACUTE
U+00EA  ê   28  LATIN SMALL LETTER E WITH CIRCUMFLEX
U+00EB  ë   8400    LATIN SMALL LETTER E WITH DIAERESIS
U+00EE  î   3   LATIN SMALL LETTER I WITH CIRCUMFLEX
U+00F6  ö   12678   LATIN SMALL LETTER O WITH DIAERESIS
U+00FB  û   26  LATIN SMALL LETTER U WITH CIRCUMFLEX
U+00FC  ü   5863    LATIN SMALL LETTER U WITH DIAERESIS
U+0186  Ɔ   58  LATIN CAPITAL LETTER OPEN O
U+0190  Ɛ   70  LATIN CAPITAL LETTER OPEN E
U+0254  ɔ   8123    LATIN SMALL LETTER OPEN O
U+025B  ɛ   11942   LATIN SMALL LETTER OPEN E
U+0269  ɩ   990 LATIN SMALL LETTER IOTA
U+028B  ʋ   2763    LATIN SMALL LETTER V WITH HOOK
U+02BC  ʼ   20080   MODIFIER LETTER APOSTROPHE
U+02D7  ˗   3786    MODIFIER LETTER MINUS SIGN
U+02EE  ˮ   7836    MODIFIER LETTER DOUBLE APOSTROPHE
U+0304      1   COMBINING MACRON
U+0308      4589    COMBINING DIAERESIS
U+2026  …   7   HORIZONTAL ELLIPSIS
U+203A  ›   15  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
U+A78A  ꞊   5447    MODIFIER LETTER SHORT EQUALS SIGN
U+FFF9      17  INTERLINEAR ANNOTATION ANCHOR
iandoug commented 6 years ago

Apologies ... was a bit confused. U+000A == line feed: this is correct Unix/Linux end of line marker.

Was actually confusing Unix/Linux with OLD Mac version. Not sure why my editor (Kate) said the file had Macintosh line endings... possibly the old source material was done on old Mac.

Windows - Lines end with both a [CR] followed by a [LF] character Linux - Lines end with only a [LF] character Macintosh (Mac OSX) - Lines end with only a [LF] character Macintosh (old) - Lines end with only a [CR] character

HughP commented 6 years ago

I changed the direction of the convert characters. Update your copy of the script and see if that clears things for you.

On Wed, Jun 20, 2018 at 4:15 AM, Ian Douglas notifications@github.com wrote:

Apologies ... was a bit confused. U+000A == line feed: this is correct Unix/Linux end of line marker.

Was actually confusing Unix/Linux with OLD Mac version. Not sure why my editor (Kate) said the file had Macintosh line endings... possibly the old source material was done on old Mac.

Windows - Lines end with both a followed by a character Linux - Lines end with only a character Macintosh (Mac OSX) - Lines end with only a character Macintosh (old) - Lines end with only a character

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/HughP/dnj-corpus/issues/6#issuecomment-398713759, or mute the thread https://github.com/notifications/unsubscribe-auth/AACdJpS7WVqH0PwZvgK8PWLl7Pe-ERi7ks5t-i7UgaJpZM4UhRAp .