debug2012 / solr-php-client

Automatically exported from code.google.com/p/solr-php-client
Other
0 stars 0 forks source link

Send query with non-latin characters to Solr #78

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. In a search field enter a keyword(s) in Greek (ex. δοκιμή) (test)
2. Press Enter to search (if you have documents indexed in Greek and the word 
exists nothing comes out)
3. Check the URL created and sent to Solr for searching

What is the expected output? What do you see instead?

The expected ouput is a list of documents submitted successfully and contain 
Greek words/characters

What version of the product are you using? On what operating system?

latest on a linux box (centos6.2)

Please provide any additional information below.

Solved it by bypassing the build-in php function http_build_query. NOTE: 
Solr is running with Tomcat as the servlet container and URIEncoding is set to 
UTF-8 in the tomcat server configuration file.

Original issue reported on code.google.com by andreas....@gmail.com on 27 May 2012 at 3:25

GoogleCodeExporter commented 8 years ago
Can you expand on what you mean by "bypassing the build-in php function 
http_build_query"?

I don't think http_build_query, that i'm aware, actually cares about the 
encoding of the data you try to pass through it - it just percent encodes 
everything falling outside an alphanumeric byte range (and some special bytes). 
So if the query given to the search function is UTF-8 it should be fine

Are you sure the entered data came to the PHP script as UTF-8? You can check it 
with functions like mb_check_encoding

Additionally, if you're sure things are going through, you can try using the 
POST method on search (its a parameter). This sends along charset=utf-8 in the 
content-type request header. Have to be sure your data is actually utf-8 though.

Original comment by donovan....@gmail.com on 27 May 2012 at 8:34

GoogleCodeExporter commented 8 years ago
By 'bypassing' i mean that i used code to build the request query.

To answer your question. The string passed to the search method is utf-8 
encoded 

(var_dump(mb_check_encoding($query,'UTF-8')) == true.

The URL which is send to tomcat is: 
wt=json&json.nl=map&q=%CE%B2%CE%B1%CF%81%CE%B9%CE%AD%CE%BC%CE%B1%CE%B9&start=0&r
ows=10000

the word δοκιμή (test) is the url-encoded string 
%CE%B2%CE%B1%CF%81%CE%B9%CE%AD%CE%BC%CE%B1%CE%B9. 

With this query solr replies that NO documents where found.

BUT with the following wt=json&json.nl=map&q=δοκιμή&start=0&rows=10000 
results are returned as they where supposed to.

Hope i gave a more clear description of the issue.

Keep up the good work.

%CE%B2%CE%B1%CF%81%CE%B9%CE%AD%CE%BC%CE%B1%CE%B9

Original comment by andreas....@gmail.com on 28 May 2012 at 10:35

GoogleCodeExporter commented 8 years ago
First, wanted to know if it was intentional that the urlencoded example decodes 
to βαριέμαι and not  δοκιμή (as your email says it should)?

I just ran this to quickly see that:

php -r "echo urldecode('%CE%B2%CE%B1%CF%81%CE%B9%CE%AD%CE%BC%CE%B1%CE%B9');"

If it was just a copy and paste mistake from another test, that's fine, just 
wanted to make sure we aren't trying to compare two different things.

Lastly, I finally got around to fully checking this out in an encoding 
perspective and verified what I expected:

 * If I use utf-8 search queries, but the servlet container for Solr is using the default URI encoding (latin-1) then I need to submit my query using the POST method so its interpreted correctly. NOTE: if you use the POST method it is expected you ALWAYS are using utf-8 data - there is currently no way to specify another encoding for the Content-Type header that's sent.

 * If I use utf-8 search queries, and my servlet container for Solr is using UTF-8 as the URI encoding (in tomcat this can be set in server.xml at the Connector element) then everything works fine in both GET and POST http methods.

Hope you found the answer to your issues in the meantime.

Original comment by donovan....@gmail.com on 1 Jun 2012 at 6:23

GoogleCodeExporter commented 8 years ago
currently works as expected.

Original comment by donovan....@gmail.com on 28 Aug 2012 at 2:25