Cutezjz / galagosearch

Automatically exported from code.google.com/p/galagosearch
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Non-ASCII character issues in web interface #1

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Non-ASCII characters show up incorrectly in the web interface; in the snippet 
view and in document 
view.  A binary dump of the web server output indicates that the bytes have 
been changed.  
Dumping documents at the command line from corpus file produces correct output.

Also, request.getParameter() does not parse special characters correctly.  If 
there are non-ASCII 
characters in the parameter, they may get dropped.

Need to follow this up with the Jetty list.

Original issue reported on code.google.com by trevor.s...@gmail.com on 5 Jan 2009 at 12:48

GoogleCodeExporter commented 8 years ago
Actually, I think the claim about "Dumping documents at the command line from 
corpus file produces correct output." is not true. So this may not be a Jetty 
issue after
all.

I tried

en/articles/m/a/r/Martha_Root_9570.html

and what you get back from galago doc

from wiki-small, and found a difference in character position 0005661 octal.

The key bytes from the original are

octal 342 200 223 (which are the UTF-8 encoding of – 0xE2 0x80 0x93 )

and from the galago processed one the bytes are

211 077 077  (which are not the UTF-8 encoding of anything good)

Original comment by christop...@gmail.com on 24 Feb 2010 at 5:12

GoogleCodeExporter commented 8 years ago
Further information. My previous comment is perhaps going in the wrong 
direction. It now looks as if there
are two issues. 

1) The issue that I noticed is because the distributed wiki-small.corpus 
appears to have some Unicode 
mistakes. This can be fixed by rebuilding from the contents of the tar files.

2) The Jetty search interface still doesn't work right even with the corrected 
corpus file. So the original bug 
diagnosis is not wrong.

The wiki-small.corpus file from the download  is different from the one that is 
created on my Mac by 
unpacking the wiki-small.tar and rebuilding using galago make-index

galagosearch cbrew$ ls -l my-wiki.corpus ~/tmp/wiki-small.corpus 
-rw-------@ 1 cbrew  cbrew  37133537 Feb 11 01:50 
/Users/cbrew/tmp/wiki-small.corpus
-rw-r--r--  1 cbrew  cbrew  37062226 Feb 25 07:37 my-wiki.corpus

If I now do 

bash galagosearch-core/target/appassembler/bin/galago doc  my-wiki.corpus  
Martha_Root_9570 > 
martha.html

the en-dash is OK, but, indeed, the jetty interface doesn't do the right thing 
yet.  

Original comment by christop...@gmail.com on 25 Feb 2010 at 7:13

GoogleCodeExporter commented 8 years ago
This is actually a more fundamental problem. It can be fixed by going into the 
UniversalParser and changing 
lines 36-41 into

if (split.isCompressed) {
            reader = new BufferedReader(new InputStreamReader(
                    new GZIPInputStream(stream),"UTF-8"));
        } else {
            reader = new BufferedReader(new InputStreamReader(stream,"UTF-8"));
        }

whereupon the search interface behaves right. The problem is that it is not in 
general clear whether a file
really is UTF-8 at this stage. So the "fix" introduces errors if the file isn't 
UTF-8 encoded.

Original comment by christop...@gmail.com on 10 Mar 2010 at 3:33

GoogleCodeExporter commented 8 years ago
Here's a diff  to the current SVN that appears to provisionally  fix the UTF-8 
issues. The problem was that the 
internal representation of the strings returned by UniversalParser was always 
wrong (km²  was being read as 4 
chars not 3, because Java assumes iso8859-1) but that the output of the 
handle_doc branch of 
App.java was also wrong, so the batch search finished up doing the right thing 
by mistake. 

The issue of what happens if the documents being indexed are not UTF-8 
compatible still needs fixing. 
Fortunately, this is irrelevant to ASCII documents, because UTF-8 is an ASCII 
superset. Unfortunately, it is
not irrelevant for legacy encodings in the iso8859 series, or for similar CJK 
encodings. 

Original comment by christop...@gmail.com on 10 Mar 2010 at 12:57

Attachments: