Faulty character #x001E appears in csv dump - better sub with hyphen in Word to csv dump

epogrebnyak commented 9 years ago

Word misappropriates codepoint #x001E for this non-breaking hyphen, instead of using the proper Unicode code point #x2011. Codepoint #x001E is actually an ASCII control code indicating a record separator

Example in "die stamping machines and hammers" in ind06/tab4.doc

epogrebnyak commented 9 years ago

Also:

Error in convert.py going to screen (command line):

return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 536: character maps to <undefined>

does not appear when streamed to to file line python convert.py > 1.txt does not appear in IDE (Spyder)

kiranbeethoju commented 3 years ago

I solved it Follow this if you still face any issue. `> > Char #x001a ('\x1a') <\x1a> is illegal.

Char #x001b ('\x1b') <\x1b> is illegal. Char #x001c ('\x1c') <\x1c> is illegal. Char #x001d ('\x1d') <\x1d> is illegal. Char #x001e ('\x1e') <\x1e> is illegal. Char #x001f ('\x1f') <\x1f> is illegal. Char #x0020 (' ') < > is legal. Char #x0021 ('!') <!> is legal. Char #x0129 ('Ä©') <Ä©> is legal. Char #xd8ff ('\ud8ff') <> is illegal. Char #xfff0 ('\ufff0') <ï¿°> is legal. Char #xffff ('\uffff') <ï¿¿> is illegal. Char #x10fff0 ('\U0010fff0') <ô¿°> is legal. ---------- Differences in targetReport ---------- 19c19 < 1c1,28

1c1,40 49,50c49,61 < > Char #x001a ('\x1a') < < \ No newline at end of file

Char #x001a ('\x1a') <> is illegal. Char #x001b ('\x1b') <> is illegal. Char #x001c ('\x1c') <> is illegal. Char #x001d ('\x1d') <> is illegal. Char #x001e ('\x1e') <> is illegal. Char #x001f ('\x1f') <> is illegal. Char #x0020 (' ') < > is legal. Char #x0021 ('!') <!> is legal. Char #x0129 ('Ä©') <Ä©> is legal. Char #xd8ff ('\ud8ff') <> is illegal. Char #xfff0 ('\ufff0') <ï¿°> is legal. Char #xffff ('\uffff') <ï¿¿> is illegal. Char #x10fff0 ('\U0010fff0') <ô¿°> is legal. `

epogrebnyak commented 3 years ago

I solved it Follow this if you still face any issue. `> > Char #x001a ('\x1a') <\x1a> is illegal.

@kiranbeethoju - No longer an issue, but thanks for the link!

epogrebnyak / data-rosstat-kep

Faulty character #x001E appears in csv dump - better sub with hyphen in Word to csv dump #8

1c1,40 49,50c49,61 < > Char #x001a ('\x1a') < < \ No newline at end of file