adulau / Forban

Forban is a p2p application for link-local and local area networks. Forban works independently from the Internet and uses only the local area capabilities to announce, discover, search or share files. Forban relies on HTTP and it is "opportunistic".
http://www.foo.be/forban/
134 stars 24 forks source link

German Umlauts break search #18

Open MaStr opened 11 years ago

MaStr commented 11 years ago

Hi, on a remote system, I have a file with a "ö". This causes to break the search on every connected forban.

Browsing works.

Error in forbarn_share_error.log:

--- will included soon ---

Any idea?

Matthias

MaStr commented 11 years ago

[21/Nov/2012:08:02:55] HTTP Traceback (most recent call last): File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call self.body = self.oldhandler(_args, _kwargs) File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call return self.callable(_self.args, _self.kwargs) File "/opt/forban/bin/forban_share.py", line 246, in q html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1])) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:02:55] HTTP Request Headers: REFERER: http://piratebox.lan:12555/ HOST: piratebox.lan:12555 CONNECTION: keep-alive CACHE-CONTROL: max-age=0 Remote-Addr: ::ffff:192.168.1.168 ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5 ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3 USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 ACCEPT-LANGUAGE: en-US,en;q=0.8 ACCEPT-ENCODING: gzip,deflate,sdch [21/Nov/2012:08:03:01] HTTP Traceback (most recent call last): File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call self.body = self.oldhandler(_args, *_kwargs) File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call return self.callable(_self.args, **self.kwargs) File "/opt/forban/bin/forban_share.py", line 246, in q html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1])) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:01] HTTP Request Headers: REFERER: http://piratebox.lan:12555/ HOST: piratebox.lan:12555 CONNECTION: keep-alive CACHE-CONTROL: max-age=0 Remote-Addr: ::ffff:192.168.1.168 ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5 ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3 USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 ACCEPT-LANGUAGE: en-US,en;q=0.8 ACCEPT-ENCODING: gzip,deflate,sdch [21/Nov/2012:08:03:08] HTTP Traceback (most recent call last): File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call self.body = self.oldhandler(_args, *_kwargs) File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call return self.callable(_self.args, **self.kwargs) File "/opt/forban/bin/forban_share.py", line 246, in q html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1])) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:08] HTTP Request Headers: REFERER: http://piratebox.lan:12555/ HOST: piratebox.lan:12555 CONNECTION: keep-alive CACHE-CONTROL: max-age=0 Remote-Addr: ::ffff:192.168.1.168 ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5 ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,_;q=0.3 USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 ACCEPT-LANGUAGE: en-US,en;q=0.8 ACCEPT-ENCODING: gzip,deflate,sdch [21/Nov/2012:08:03:45] HTTP Traceback (most recent call last): File "/opt/forban/lib/ext/cherrypy/_cprequest.py", line 656, in respond response.body = self.handler() File "/opt/forban/lib/ext/cherrypy/lib/encoding.py", line 188, in call self.body = self.oldhandler(_args, *_kwargs) File "/opt/forban/lib/ext/cherrypy/_cpdispatch.py", line 34, in call return self.callable(_self.args, **self.kwargs) File "/opt/forban/bin/forban_share.py", line 246, in q html += """%s %s """ % (foundfiles[0].rsplit(",",1)[0],forban_geturl(uuid=foundfiles[1],filename=filename),discoveredloot.getname(foundfiles[1])) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

[21/Nov/2012:08:03:45] HTTP Request Headers: REFERER: http://piratebox.lan:12555/ HOST: piratebox.lan:12555 CONNECTION: keep-alive CACHE-CONTROL: max-age=0 Remote-Addr: ::ffff:192.168.1.168 ACCEPT: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5 ACCEPT-CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3 USER-AGENT: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3 ACCEPT-LANGUAGE: en-US,en;q=0.8 ACCEPT-ENCODING: gzip,deflate,sdch

MaStr commented 11 years ago

btw: in the browse-list, the Umlaut looks like an utf character (double byte)

adulau commented 11 years ago

Hi Matthias,

I did a small test writing a file named "ö.txt" in the share directory.

http://127.0.0.1:12555/q/?v=%C3%B6

I didn't get the same exception. Could you start a Python on the server and check the default encoding?

import sys
print sys.getdefaultencoding()

Just to be sure.

MaStr commented 11 years ago

root@rPt4WCYo:/# python
Python 2.7.3 (default, Nov 3 2012, 11:37:47) [GCC 4.6.3 20120201 (prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import sys print sys.getdefaultencoding()>>> ascii

adulau commented 11 years ago

I found the issue(s) but I'm currently struggling how to fix it properly, the issue is from the incoming value from the filename (encoded in UTF-8) but the default codec in Python (for forban_share - line 243->250) is usually ascii and then the filename is encoded back into b64 encoding library Python where UTF-8 is not appreciated...

I tested with some ".decode("utf-8").encode("latin-1")" but it doesn't work in a consistent among the Python version and especially regarding the site configuration of the encoding. If you have any ideas, let me know. I'll check some other ideas.

MaStr commented 11 years ago

Is it possible to redefine the default encoding around decoding base64 and turn it back to ascii later? import sys; sys.setdefaultencoding('utf-8') Or what about reducing every filename (complete while hashing, searching and the whatever) to ascii?

I learned a few things in my System-Administration and Userhelp on IBM Websphere MQ about all this sh**\ encoding stuff: You have to know which encoding enters your system and what you use inside (i.e. during modification). I think one problem maybe a filename on the disc, not encoded in utf but having special character in i.e. ISO...-15 .

The complete platform independend steps should be something like this:

  1. Get Filename
  2. Convert Filename to UTF (if it already is, this shouldn't change anything)
  3. encode to base64
  4. decode to string in UTF (assuming you can accept UTF encoding while decoding)

If the normal base64.decode can't handle this well, you may try this library for encode and decode: http://docs.python.org/2.7/library/binascii.html?highlight=binascii#binascii

In a short overview it looks like an "convert any byte-array to hex" functionality. This should work like the default base64 function... with the flaw you have to convert back to string again.

adulau commented 11 years ago

Thanks for the feedback.

That's exactly the step 4 that is an issue. The base64 modules of Python is also relying on the binascii module. I'll give another try.

MaStr commented 11 years ago

Hi, just found out, that this issue breaks the "remote download" functionality. You are visting Forban on your box, click in the line ofanother Forban "browse" and then "get" you recieve a 404 error that /s/ is not available.

:( Matthias

toebbel commented 11 years ago

Try this: Add the following lines to your app config.

tools.decode.on = True
tools.encode.on = True
tools.encode.encoding = "utf-8"
tools.decode.encoding = "utf-8"

via http://stackoverflow.com/a/4915497/359326

adulau commented 11 years ago

Yep, I tried sometime ago but the result is variable depending of the Python 2 version and the platform where it's running. I'll build a set of test case to see where the origin of the issue is. Thank you.