Sicos1977 / IFilterTextReader

A reader that gets text from different file formats through the IFilter interface
Other
55 stars 38 forks source link

Weird text encoding issue with colons and section symbols #37

Closed richardtallent closed 5 years ago

richardtallent commented 5 years ago

I'm reading from this site:

https://www.doa.la.gov/osr/lac/33v01/33v01.doc

This is a Word 97-2000 file created by a contractor for the State of Louisiana (I'm not affiliated with either). When I use FilterReader.ReadToEnd() to pull the text, the section symbols (§) are replaced with colons (:). There may be some other substitutions, but this one stuck out as quite obvious.

I thought it could be a text encoding issue, but I can't find a code page that uses ":" for 0x00A7, and there doesn't appear to be a way in Word 2013 to see which encoding the file is using.

This could be an unsolvable problem with the underlying IFilter driver, but I thought it was worth mentioning in case it's something this library can account for.

mantis commented 5 years ago

@richardtallent FilterReader.cs calls CleanUpCharacters() on the string from the filter:

That replaces a number of "Junk" characters e.g.:

                case 0x00A7: // section-sign
                case 0x2020: // dagger
                case 0x2021: // double-dagger
                case 0x2022: // bullet
                case 0x2023: // triangle bullet
                case 0x203B: // reference mark
                case 0xFE55: // small colon
                    buffer[i] = ':';

Removing the above means you get section symbol.

I wonder if this "cleanup" should be being done by default or not