Closed GoogleCodeExporter closed 9 years ago
JOOI - does the html IFilter get it right?
I guess then in general you can't know what the character encoding is for a
document.
Perhaps it's worth considering a way of allow the character encoding to be
configured
somehow. This would need to be thought through since different filters might
generate
text encoded in different ways for the same document.
Original comment by paul.x.r...@googlemail.com
on 21 Dec 2007 at 6:17
I don't understand your comment that "in general you can't know what the
character
encoding is for a document". Possibly that's true, but in this case we do know
the
character set - it's in a meta tag in the HTML - the problem is simply that the
html
parser doesn't know how to convert that character set to UTF-8. If we use
iconv, or
some windows-specific character set conversion mechanism, we should be able to
cope
with pretty much any character set out there. There will always be a
possibility of
coming across an unknown character set - we should probably raise that as a
warning.
Original comment by boulton.rj@gmail.com
on 21 Dec 2007 at 9:42
iconv is certainly available for Windows:
http://gnuwin32.sourceforge.net/packages/libiconv.htm
Original comment by charliej...@gmail.com
on 9 Jan 2008 at 1:47
Then the simplest fix is probably simply to use iconv. However, iconv with a
full
set of character set databases is likely to be rather large, so if we can use
the
windows built-in character set conversion routines, that might well be
preferable.
Original comment by boulton.rj@gmail.com
on 9 Jan 2008 at 4:34
I'm not sure what you mean by the built-in routines in this context. Perhaps we
should try iconv - how do we do this? The current package is 8.4 MB for
reference.
Original comment by charliej...@gmail.com
on 10 Jan 2008 at 1:26
I thought you said (maybe it was at a face-to-face meeting) that the windows OS
has
its own routines for converting character sets. Such routines are what I mean
by
"built-in" routines.
To try out iconf: first you edit /libs/htmltotext/src/config.h to say
#define USE_ICONV 1
Then you have to ensure that <iconv.h> is on the include path and that the iconv
library gets linked in to the build htmltotext DLL. This probably involves
hacking
the libs/htmltotext/setup.py file in some way, but I'm not sure what the
details of
that will be.
Original comment by boulton.rj@gmail.com
on 10 Jan 2008 at 1:39
iconv is now used and the issue appears to be fixed. It's only inflated the
package
size by 400kb or so.
Original comment by charliej...@gmail.com
on 10 Jan 2008 at 3:37
Original issue reported on code.google.com by
boulton.rj@gmail.com
on 17 Dec 2007 at 4:08