getRSSfeed broken for places containing non-ascii characters

MatthiasNieuwenhuisen commented 13 years ago

Non-valid RSS feeds are created in some regions (e.g. http://openstreetbugs.schokokeks.org/api/0.1/getRSSfeed?b=50.62895&t=50.78353&l=6.89193&r=7.30323 ). Here, the feed stops after 7 bugs, leaving some XML tags open.

I tracked down the problem to the c[6].encode("utf-8") in line 80 of getRSSfeed, which raises this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

The bug not displayed here, is near "Lüftelberg", which is the only place with an umlaut in the SQL result. For me it works, if I remove the call to encode and output the raw content of c[6] instead.

Flachzange commented 13 years ago

I am not sure if my problem is related to this issue, but it is very hard to find valid RSS-Links that my tested rss clients can read. I tested it with Mozilla Thunderbird, Miranda IM Plugin, on my HTC phone and Google. Except Google all of the clients give me an error message for most of the rss feed links in my region (Kassel, Germany). For example:

http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fopenstreetbugs.schokokeks.org%2Fapi%2F0.1%2FgetRSSfeed%3Fb%3D51.26201%26t%3D51.27278%26l%3D9.43468%26r%3D9.45648

I hope this can be fixed as the rss feed feature is really valuable.

Edit: Okay, just checked if the selected region contains non-ascii characters and indeed, this is the case

nattomi commented 12 years ago

Hugoe, my experience was exactly the opposite: the whole thing only worked if I used c[6].encode("utf-8") instead of the raw c[6], see https://github.com/emka/openstreetbugs/issues/27. I had problems with place names containing charachters such as ő,ű, etc. I'm not experienced in python at all, but that modification helped solving my issue -- that's why I was suggesting it.

emka commented 12 years ago

I removed the "(near %s)" (on the server, not in git) as temporary fix. If somebody wants to dig deeper into the charset issues, go ahead.

KurtKrampmeier commented 12 years ago

Great, this brought my feed back to life. :) I checked the code and it looks like you are simply trying to convert the string to UTF-8 twice. Once in addPOIexec before it is inserted into the database, and then again in getRSSfeed after getting the string from the database. While I do not know much about Python, using encode("utf-8") twice without using decode("utf-8") in between is likely wrong. Reverting getRSSfeed to the version before commit 9c3825490d8bc1fff2ce61f82099ced3295be9da should get it back working with the name in the title, as long as the string got inserted correctly into the database. Adding new OpenStreetBugs containing non-ASCII characters will still work, unless the changes in addPOIexec are also reverted.

nattomi commented 12 years ago

@mibe, I tried your suggestion on my local install of the OSB CMS, but it didn't help either. My installation only works with the encode("utf-8") thing. If I don't use name.encode("utf-8") in addPOIexec then it doesn't insert some bugs with exotic characters into the database even if I specify charset="utf8" while establishing the MySQL connection. Also, I need to use c[6].encode("utf-8"), otherwise the RSS feed doesn't display any items.

mibe commented 12 years ago

Do you have UTF-8 or ISO 8819-* data in your database? Because this is still the problem on the main OSB installation. The content in this database is a mixture of UTF-8 and ISO 8819-1, which means this fix is working for UTF-8, but not for ISO 8819-1 data. There was also some discussion about the mixture on the german OSM mailinglist.

nattomi commented 12 years ago

@mibe, the characterset of the 'bugs' table is set to utf8. I haven't specified any individual characterset for any columns, so I suppose all my columns inherits utf8 from the overall table setting I mentioned above. Quotation from dev.mysql.com: "There are default settings for character sets and collations at four levels: server, database, table, and column". I have no idea about what is the situation on my side with the database and server levels. My database is filled with data such as "árvíztűrő tükörfúrógép", which is ISO 8819-2.

mibe commented 12 years ago

@nattomi, do you have utf8 as the default server character set or did you set that manually? Because the create-database.sql is missing collations, which means the server's default collations would be used. My database, table and columns have the latin1_swedish_ci collation (default on my server), which is not perfect but irrelevant for the data inside the table. But where does that ISO 8819-2 data come from? AFAIK geonames.org is returning UTF-8 formatted data...

nattomi commented 12 years ago

@mibe, I set this manually using a GUi. I don't know anything about the formatting of the geonames xml data and I also don't have any idea how could we figure it out. Here is an example findNearbyPlace query which messes things up for me: http://ws.geonames.org/findNearbyPlaceName?lat=47.635&lng=16.7. The content of the element here is 'Fertőboz' and character 'ő' is the out-of-range one. When I said ISO 8859-2 in my last comment, I didn't mean that it is the actual encoding, I only meant that for such strings this is the "official" encoding (also known as Central European in this case). I wish I would know how to check the actual encoding.

KurtKrampmeier commented 12 years ago

There is no such thing as an "official" encoding for a certain character. While the character ő can be expressed using ISO-8859-2 encoding (with the byte value 0xF5), it can also be expressed using using a Unicode encoding like UTF-8 (requiring two bytes: 0xC5 followed by 0x91). Unicode has the great advantage to cover virtually all characters worldwide.

Thus UTF-8 is widely used, when characters from different languages have to be usable at the same time. GeoNames is also using UTF-8 in the XML responses. How do I know this? Quite easy: UTF-8 is the default encoding for XML files, but it is also explicitly stated in the XML prologue at the beginning of the file (<?xml version="1.0" encoding="UTF-8" standalone="no"?>) and also in the HTTP response header (Content-Type:text/xml;charset=UTF-8). You can also use a network analyzer like Wireshark to view the response as binary data. You will see the character is encoded as the bytes C5 91.

So far, I know what I am talking about. When it comes to Python, I can only guess, but you might want to check, if the strings are written into the database with the correct (UTF-8) encoding, e. g. by printing the query as a hexdump (and verifying that the database connection expects a UTF-8 string). This way you should be able to narrow down the problem to either creating new bugs or reading from database to build the RSS file. Since everything in this chain should be using UTF-8, no conversion/encoding should be necessary in any step. Old database entries with a broken/different encoding could simply be deleted for a quick fix. However I have to admit, that I did not fully understand http://docs.python.org/howto/unicode, as I don't know, which data type is used in which place (e. g. when reading from the file and from the database). So there still might be some place, where you need to do some conversion.

nattomi commented 12 years ago

Thank you @KurtKrampmeier for enlightening me, this short encoding guide is certainly useful for us. I knew well that there is no "mandatory" encoding, I guess I just expressed myself in a confusing way.

In the meantime I created a simplyfied version of the addPOIexec script, which I made available at http://storage.ggki.hu/~nattomi/python-geonames/. You can specify lat and lon in the beginning and then it queries the geonames service and then it uses minidom to read the content of the "name" and "country" tags. I store the result of the urllib2.urlopen() function in the variable gnData (so the content in gnData is free of any kind of minidom parsing). Then it comes the minidom part, which reads out "name" the wrong way. At the end of the script I print out the variables gnData and values. This is what I get: <?xml version="1.0" encoding="UTF-8" standalone="no"?>

Fertőboz Fertőboz 47.63638 16.70085 3052647 HU Hungary P PPL 0.16601

{'lat': 47.635, 'lon': 16.7, 'nearbyplace': u'Fert\u0151boz [HU]'}

So the content in gnData is "Fertőboz" but it is "Fert\u0151boz" in the values array. Therefore the source of the buggy behaviour is the minidom part: gnData = response.read() dom = minidom.parseString(gnData) if dom.getElementsByTagName('name'): name = dom.getElementsByTagName('name')[0].firstChild.data country = dom.getElementsByTagName('countryCode')[0].firstChild.data

IMO this is what we need to fix.

nattomi commented 12 years ago

UPDATE: However, if I also print the name variable, I get the right content, i.e. "Fertőboz". It seems like the content of the 'name' variable is properly encoded as a standalone variable, but its encoding goes wrong as soon as it gets added to the 'values' array.

mibe commented 12 years ago

@nattomi: I'm not a python guru, but isn't the "u'Fert\u0151boz" intended behaviour when print'ing dictionaries with unicode strings? Try this script here: https://gist.github.com/2363259 You'll see that the dictionary has the \x and \u chars, but the difference is the first is using the "str" class, while the other one is using the "unicode" class. When print'ing these strings the difference is that the instance of the str class AFAIK doesn't care if the console understands unicode chars (that's why in the Windows ouptut the name is garbled), while the other one does, which results in an error.

The values dictionary isn't the problem here. Try to print values["nearbyplace"] additionally to name, and you'll see it has the same content on the console, as long as your console supports unicode chars (e.g. my Windows box doesn't, it's using cp850: I'm getting an UnicodeEncodeError. On a SSH session to a debian box, which has UTF-8 support, python is not complaining.).

nattomi commented 12 years ago

@mibe: I'm myself not a Python (this is the first time I touched python code) nor a character encoding guru either, but this bug bugs the hell out of me and somehow I have the impression that it shouldn't be that complicated to resolve it. I feel like that now we already heading somewhere and we are only a few iterations away from achieving this. I tried to print values["nearbyplace"], my simplyfied script and the result is shown here: https://gist.github.com/2367095. I accept that "u'Fert\u0151boz" is the intended behaviour, however, I'm still a bit confused why it is printed differently in values and in values["nearbyplace"] -- this is probably due to the lack of proper python knowledge from my part.

nattomi commented 12 years ago

Ok, my installation works well now. The important thing was to specify utf8 in the database for the 'text' and 'nearby_place' column. I'm not sure that it's needed, but I also set the charset on the table level to utf8. Moreover, you must use consequently the 'charset = "utf8", use_unicode = True' arguments of MySQLdb.connect in all .py files. The name.encode('utf-8') idea was wrong, sorry for the inconvenience I might caused with that. See issues #38, #39, #40, #41, #42, #43.

emka / openstreetbugs

getRSSfeed broken for places containing non-ascii characters #29