Closed richardtallent closed 5 years ago
@richardtallent FilterReader.cs calls CleanUpCharacters() on the string from the filter:
That replaces a number of "Junk" characters e.g.:
case 0x00A7: // section-sign
case 0x2020: // dagger
case 0x2021: // double-dagger
case 0x2022: // bullet
case 0x2023: // triangle bullet
case 0x203B: // reference mark
case 0xFE55: // small colon
buffer[i] = ':';
Removing the above means you get section symbol.
I wonder if this "cleanup" should be being done by default or not
I'm reading from this site:
https://www.doa.la.gov/osr/lac/33v01/33v01.doc
This is a Word 97-2000 file created by a contractor for the State of Louisiana (I'm not affiliated with either). When I use
FilterReader.ReadToEnd()
to pull the text, the section symbols (§
) are replaced with colons (:
). There may be some other substitutions, but this one stuck out as quite obvious.I thought it could be a text encoding issue, but I can't find a code page that uses ":" for 0x00A7, and there doesn't appear to be a way in Word 2013 to see which encoding the file is using.
This could be an unsolvable problem with the underlying IFilter driver, but I thought it was worth mentioning in case it's something this library can account for.