Some character sets are not supported for HTML document conversion

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Index the sample collection in src/test/sampledocs
2. Search for "clifford"
3. This will return two documents - one is a direct copy of
http://alanmacfarlane.com/DO/filmshow/geertz1tx.htm, which is in the
windows-1252 character set, and one is a version of this which has been
converted to UTf8.  In the former, various characters will be displayed as
a "?" in a diamond in my browser, indicating invalid data.  They should be
displayed as dashes - and in the UTF8 version, they are displayed as such.
 This is because the html text extractor does not understand the
windows-1252 character set.

There are two possible fixes - firstly, the html extractor can be compiled
with the "iconv" library on platforms where iconv is available, and will
then support the windows-1252 character set (and many others).  This is
done by changing the contents of config.h, and there should really be a
configure script to use this if the platform supports it.  Alternatively, a
windows-specific mechanism of converting character set may be used (I'm not
sure if iconv is available on windows).

Comments on how best to convert between character sets on windows would be
welcomed.

Original issue reported on code.google.com by boulton.rj@gmail.com on 17 Dec 2007 at 4:08

GoogleCodeExporter commented 9 years ago

JOOI - does the html IFilter get it right?

I guess then in general you can't know what the character encoding is for a 
document. 

Perhaps it's worth considering a way of allow the character encoding to be 
configured
somehow. This would need to be thought through since different filters might 
generate
text encoded in different ways for the same document.

Original comment by paul.x.r...@googlemail.com on 21 Dec 2007 at 6:17

GoogleCodeExporter commented 9 years ago

I don't understand your comment that "in general you can't know what the 
character
encoding is for a document".  Possibly that's true, but in this case we do know 
the
character set - it's in a meta tag in the HTML - the problem is simply that the 
html
parser doesn't know how to convert that character set to UTF-8.  If we use 
iconv, or
some windows-specific character set conversion mechanism, we should be able to 
cope
with pretty much any character set out there.  There will always be a 
possibility of
coming across an unknown character set - we should probably raise that as a 
warning.

Original comment by boulton.rj@gmail.com on 21 Dec 2007 at 9:42

GoogleCodeExporter commented 9 years ago

iconv is certainly available for Windows:
http://gnuwin32.sourceforge.net/packages/libiconv.htm

Original comment by charliej...@gmail.com on 9 Jan 2008 at 1:47

GoogleCodeExporter commented 9 years ago

Then the simplest fix is probably simply to use iconv.  However, iconv with a 
full
set of character set databases is likely to be rather large, so if we can use 
the
windows built-in character set conversion routines, that might well be 
preferable.

Original comment by boulton.rj@gmail.com on 9 Jan 2008 at 4:34

GoogleCodeExporter commented 9 years ago

I'm not sure what you mean by the built-in routines in this context. Perhaps we
should try iconv - how do we do this? The current package is 8.4 MB for 
reference.

Original comment by charliej...@gmail.com on 10 Jan 2008 at 1:26

GoogleCodeExporter commented 9 years ago

I thought you said (maybe it was at a face-to-face meeting) that the windows OS 
has
its own routines for converting character sets.  Such routines are what I mean 
by
"built-in" routines.

To try out iconf: first you edit /libs/htmltotext/src/config.h to say

#define USE_ICONV 1

Then you have to ensure that <iconv.h> is on the include path and that the iconv
library gets linked in to the build htmltotext DLL.  This probably involves 
hacking
the libs/htmltotext/setup.py file in some way, but I'm not sure what the 
details of
that will be.

Original comment by boulton.rj@gmail.com on 10 Jan 2008 at 1:39

GoogleCodeExporter commented 9 years ago

iconv is now used and the issue appears to be fixed. It's only inflated the 
package
size by 400kb or so.

Original comment by charliej...@gmail.com on 10 Jan 2008 at 3:37

Changed state: Fixed

flaxsearch / flaxcode

Some character sets are not supported for HTML document conversion #162