Open GoogleCodeExporter opened 9 years ago
Actually, I think the claim about "Dumping documents at the command line from
corpus file produces correct output." is not true. So this may not be a Jetty
issue after
all.
I tried
en/articles/m/a/r/Martha_Root_9570.html
and what you get back from galago doc
from wiki-small, and found a difference in character position 0005661 octal.
The key bytes from the original are
octal 342 200 223 (which are the UTF-8 encoding of – 0xE2 0x80 0x93 )
and from the galago processed one the bytes are
211 077 077 (which are not the UTF-8 encoding of anything good)
Original comment by christop...@gmail.com
on 24 Feb 2010 at 5:12
Further information. My previous comment is perhaps going in the wrong
direction. It now looks as if there
are two issues.
1) The issue that I noticed is because the distributed wiki-small.corpus
appears to have some Unicode
mistakes. This can be fixed by rebuilding from the contents of the tar files.
2) The Jetty search interface still doesn't work right even with the corrected
corpus file. So the original bug
diagnosis is not wrong.
The wiki-small.corpus file from the download is different from the one that is
created on my Mac by
unpacking the wiki-small.tar and rebuilding using galago make-index
galagosearch cbrew$ ls -l my-wiki.corpus ~/tmp/wiki-small.corpus
-rw-------@ 1 cbrew cbrew 37133537 Feb 11 01:50
/Users/cbrew/tmp/wiki-small.corpus
-rw-r--r-- 1 cbrew cbrew 37062226 Feb 25 07:37 my-wiki.corpus
If I now do
bash galagosearch-core/target/appassembler/bin/galago doc my-wiki.corpus
Martha_Root_9570 >
martha.html
the en-dash is OK, but, indeed, the jetty interface doesn't do the right thing
yet.
Original comment by christop...@gmail.com
on 25 Feb 2010 at 7:13
This is actually a more fundamental problem. It can be fixed by going into the
UniversalParser and changing
lines 36-41 into
if (split.isCompressed) {
reader = new BufferedReader(new InputStreamReader(
new GZIPInputStream(stream),"UTF-8"));
} else {
reader = new BufferedReader(new InputStreamReader(stream,"UTF-8"));
}
whereupon the search interface behaves right. The problem is that it is not in
general clear whether a file
really is UTF-8 at this stage. So the "fix" introduces errors if the file isn't
UTF-8 encoded.
Original comment by christop...@gmail.com
on 10 Mar 2010 at 3:33
Here's a diff to the current SVN that appears to provisionally fix the UTF-8
issues. The problem was that the
internal representation of the strings returned by UniversalParser was always
wrong (km² was being read as 4
chars not 3, because Java assumes iso8859-1) but that the output of the
handle_doc branch of
App.java was also wrong, so the batch search finished up doing the right thing
by mistake.
The issue of what happens if the documents being indexed are not UTF-8
compatible still needs fixing.
Fortunately, this is irrelevant to ASCII documents, because UTF-8 is an ASCII
superset. Unfortunately, it is
not irrelevant for legacy encodings in the iso8859 series, or for similar CJK
encodings.
Original comment by christop...@gmail.com
on 10 Mar 2010 at 12:57
Attachments:
Original issue reported on code.google.com by
trevor.s...@gmail.com
on 5 Jan 2009 at 12:48