Closed iandoug closed 6 years ago
I'm also picking up these other "space" characters: (counts are from my current version of your mass-text file after assorted edits) : 3269 : 59 : 17 : 1 I think these are assorted "non-breaking-spaces" or similar, courtesy of the word processing software used, Unicode has a whack:
U+2000 EN QUAD U+200A HAIR SPACE U+00A0 NO-BREAK SPACE U+200B ZERO WIDTH SPACE U+2060 WORD JOINER U+3000 IDEOGRAPHIC SPACE U+FEFF ZERO WIDTH NO-BREAK SPACE
Just trying to persuade PHP to show me the U+... form, think I need to switch to PHP 7 to get that functionality. Or try and find them in the hex editor.
@iandoug
Yes, some of those characters need to move to symbols. I handled those in the script version of the corpus.
I don't think you have some of these code points in the corpus, for instance do you really have a U+2060
?
@iandoug did you ever figure something out for php? There seem to be some ideas here: https://stackoverflow.com/a/35213288
Fixed thees issues in the latest release of the generate-corpus.bash script.
Yes, somewhere along the way I stumbled across the json_encode trick, but it's not a complete solution since that function leaves normal low-ASCII chars as-is. So wrote some code that handled both cases. Used it for checking the fonts for required characters. Don't want to move to PHP 7 because I have some old systems that will break in PHP 7. Needs a major rewrite... :-(
I know it's closed, that list of "space characters" with Unicode code points was not from the file, but just suggestions to what the unknown characters could be. Currently have these undesirable characters in the file, after using my "strip French" program (which BTW I think misses a few French words ... will revert on that.).
: U+000A : 13342 (line feed .. should remove) : U+0308 : 3269 (combining diaeresis)... : U+000D : 897 (carriage return, Okay). : U+000C : 110 (form feed, should remove) : U+FFF9 : 17 (interlinear annotation anchor ... should probably remove) ‚ : U+201A : 7 (single low quotation mark... think you removed those) … : U+2026 : 7 (ellipsis) : U+0009 : 1 (tab) ¨ : U+00A8 : 1 (diaeresis) : U+0304 : 1 (combining macron)
Here is the characters in the corpus as I still have them:
Big things left are converting hyphen-minus to letter minus. and making sure there are not double carriage returns. Then there are a couple of one offs that I need to hunt down and fix.
code point glyph count character name
U+0009 240 CHARACTER TABULATION
U+000D 16763 CARRIAGE RETURN
U+0020 78603 SPACE
U+0021 ! 68 EXCLAMATION MARK
U+0028 ( 482 LEFT PARENTHESIS
U+0029 ) 483 RIGHT PARENTHESIS
U+002A * 20 ASTERISK
U+002C , 4754 COMMA
U+002D - 27453 HYPHEN-MINUS
U+002E . 4170 FULL STOP
U+002F / 17 SOLIDUS
U+0030 0 867 DIGIT ZERO
U+0031 1 301 DIGIT ONE
U+0032 2 436 DIGIT TWO
U+0033 3 136 DIGIT THREE
U+0034 4 110 DIGIT FOUR
U+0035 5 181 DIGIT FIVE
U+0036 6 81 DIGIT SIX
U+0037 7 160 DIGIT SEVEN
U+0038 8 268 DIGIT EIGHT
U+0039 9 116 DIGIT NINE
U+003A : 488 COLON
U+003B ; 79 SEMICOLON
U+003F ? 201 QUESTION MARK
U+0041 A 1044 LATIN CAPITAL LETTER A
U+0042 B 421 LATIN CAPITAL LETTER B
U+0043 C 15 LATIN CAPITAL LETTER C
U+0044 D 767 LATIN CAPITAL LETTER D
U+0045 E 108 LATIN CAPITAL LETTER E
U+0046 F 97 LATIN CAPITAL LETTER F
U+0047 G 448 LATIN CAPITAL LETTER G
U+0048 H 26 LATIN CAPITAL LETTER H
U+0049 I 66 LATIN CAPITAL LETTER I
U+004A J 9 LATIN CAPITAL LETTER J
U+004B K 1223 LATIN CAPITAL LETTER K
U+004C L 145 LATIN CAPITAL LETTER L
U+004D M 668 LATIN CAPITAL LETTER M
U+004E N 353 LATIN CAPITAL LETTER N
U+004F O 50 LATIN CAPITAL LETTER O
U+0050 P 301 LATIN CAPITAL LETTER P
U+0052 R 8 LATIN CAPITAL LETTER R
U+0053 S 479 LATIN CAPITAL LETTER S
U+0054 T 274 LATIN CAPITAL LETTER T
U+0055 U 50 LATIN CAPITAL LETTER U
U+0056 V 121 LATIN CAPITAL LETTER V
U+0057 W 508 LATIN CAPITAL LETTER W
U+0059 Y 976 LATIN CAPITAL LETTER Y
U+005A Z 385 LATIN CAPITAL LETTER Z
U+005B [ 10 LEFT SQUARE BRACKET
U+005C \ 1 REVERSE SOLIDUS
U+005D ] 10 RIGHT SQUARE BRACKET
U+0061 a 29819 LATIN SMALL LETTER A
U+0062 b 9792 LATIN SMALL LETTER B
U+0063 c 436 LATIN SMALL LETTER C
U+0064 d 12033 LATIN SMALL LETTER D
U+0065 e 5895 LATIN SMALL LETTER E
U+0066 f 429 LATIN SMALL LETTER F
U+0067 g 10265 LATIN SMALL LETTER G
U+0068 h 15281 LATIN SMALL LETTER H
U+0069 i 8555 LATIN SMALL LETTER I
U+006A j 71 LATIN SMALL LETTER J
U+006B k 11970 LATIN SMALL LETTER K
U+006C l 3992 LATIN SMALL LETTER L
U+006D m 4357 LATIN SMALL LETTER M
U+006E n 16349 LATIN SMALL LETTER N
U+006F o 10298 LATIN SMALL LETTER O
U+0070 p 4505 LATIN SMALL LETTER P
U+0071 q 103 LATIN SMALL LETTER Q
U+0072 r 1762 LATIN SMALL LETTER R
U+0073 s 6593 LATIN SMALL LETTER S
U+0074 t 3753 LATIN SMALL LETTER T
U+0075 u 7969 LATIN SMALL LETTER U
U+0076 v 468 LATIN SMALL LETTER V
U+0077 w 8275 LATIN SMALL LETTER W
U+0078 x 85 LATIN SMALL LETTER X
U+0079 y 7427 LATIN SMALL LETTER Y
U+007A z 1964 LATIN SMALL LETTER Z
U+00A8 ¨ 1 DIAERESIS
U+00AB « 204 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00B0 ° 1 DEGREE SIGN
U+00BB » 198 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00CB Ë 46 LATIN CAPITAL LETTER E WITH DIAERESIS
U+00D6 Ö 73 LATIN CAPITAL LETTER O WITH DIAERESIS
U+00DC Ü 71 LATIN CAPITAL LETTER U WITH DIAERESIS
U+00E7 ç 21 LATIN SMALL LETTER C WITH CEDILLA
U+00E8 è 221 LATIN SMALL LETTER E WITH GRAVE
U+00E9 é 107 LATIN SMALL LETTER E WITH ACUTE
U+00EA ê 28 LATIN SMALL LETTER E WITH CIRCUMFLEX
U+00EB ë 8400 LATIN SMALL LETTER E WITH DIAERESIS
U+00EE î 3 LATIN SMALL LETTER I WITH CIRCUMFLEX
U+00F6 ö 12678 LATIN SMALL LETTER O WITH DIAERESIS
U+00FB û 26 LATIN SMALL LETTER U WITH CIRCUMFLEX
U+00FC ü 5863 LATIN SMALL LETTER U WITH DIAERESIS
U+0186 Ɔ 58 LATIN CAPITAL LETTER OPEN O
U+0190 Ɛ 70 LATIN CAPITAL LETTER OPEN E
U+0254 ɔ 8123 LATIN SMALL LETTER OPEN O
U+025B ɛ 11942 LATIN SMALL LETTER OPEN E
U+0269 ɩ 990 LATIN SMALL LETTER IOTA
U+028B ʋ 2763 LATIN SMALL LETTER V WITH HOOK
U+02BC ʼ 20080 MODIFIER LETTER APOSTROPHE
U+02D7 ˗ 3786 MODIFIER LETTER MINUS SIGN
U+02EE ˮ 7836 MODIFIER LETTER DOUBLE APOSTROPHE
U+0304 1 COMBINING MACRON
U+0308 4589 COMBINING DIAERESIS
U+2026 … 7 HORIZONTAL ELLIPSIS
U+203A › 15 SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
U+A78A ꞊ 5447 MODIFIER LETTER SHORT EQUALS SIGN
U+FFF9 17 INTERLINEAR ANNOTATION ANCHOR
Apologies ... was a bit confused. U+000A == line feed: this is correct Unix/Linux end of line marker.
Was actually confusing Unix/Linux with OLD Mac version. Not sure why my editor (Kate) said the file had Macintosh line endings... possibly the old source material was done on old Mac.
Windows - Lines end with both a [CR] followed by a [LF] character Linux - Lines end with only a [LF] character Macintosh (Mac OSX) - Lines end with only a [LF] character Macintosh (old) - Lines end with only a [CR] character
I changed the direction of the convert characters. Update your copy of the script and see if that clears things for you.
On Wed, Jun 20, 2018 at 4:15 AM, Ian Douglas notifications@github.com wrote:
Apologies ... was a bit confused. U+000A == line feed: this is correct Unix/Linux end of line marker.
Was actually confusing Unix/Linux with OLD Mac version. Not sure why my editor (Kate) said the file had Macintosh line endings... possibly the old source material was done on old Mac.
Windows - Lines end with both a followed by a character Linux - Lines end with only a character Macintosh (Mac OSX) - Lines end with only a character Macintosh (old) - Lines end with only a character
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/HughP/dnj-corpus/issues/6#issuecomment-398713759, or mute the thread https://github.com/notifications/unsubscribe-auth/AACdJpS7WVqH0PwZvgK8PWLl7Pe-ERi7ks5t-i7UgaJpZM4UhRAp .
Re Point 12 in Readme as of Sat 9 June GMT
Suggestions: U+0009 == Tab: ignore or replace with 4 spaces. We can accept keyboard will have tab key in fixed location. U+000A 30690 == line feed: delete U+000C 220 == form feed: delete U+000D 1340 == enter/return: leave. KLA uses this as enter/return. U+001E 5442 == record separator: delete U+0020 124711 == space: leave
I'm also picking up these other "space" characters: (counts are from my current version of your mass-text file after assorted edits) : 3269 : 59 : 17 : 1 I think these are assorted "non-breaking-spaces" or similar, courtesy of the word processing software used, Unicode has a whack:
U+2000 EN QUAD U+200A HAIR SPACE U+00A0 NO-BREAK SPACE U+200B ZERO WIDTH SPACE U+2060 WORD JOINER U+3000 IDEOGRAPHIC SPACE U+FEFF ZERO WIDTH NO-BREAK SPACE
Just trying to persuade PHP to show me the U+... form, think I need to switch to PHP 7 to get that functionality. Or try and find them in the hex editor.