avinash-k / gwtwiki

Automatically exported from code.google.com/p/gwtwiki
0 stars 0 forks source link

Does not recognize image tags with non-ascii characters #141

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Trying to parse an image tag with namespace that contains non-ascii characters.

Example from Romanian Wikipedia.
"[[Fişier:Chisinau Center.jpg|thumb|right|220px|Arcul de Triumf. În spatele 
său se află [[Catedrala Mitropolitană din Chişinău|Catedrala 
Mitropolitană]]]]"

Accompanying properties file

wiki.api.category1      = Categorie
wiki.api.categorytalk1  = Discuție_Categorie
wiki.api.help1          = Ajutor
wiki.api.helptalk1      = Discuție_Ajutor
wiki.api.image1         = Fișier
wiki.api.imagetalk1     = Discuție_Fișier
wiki.api.media1         = Media
wiki.api.mediawiki1     = MediaWiki
wiki.api.mediawikitalk1 = Discuție_MediaWiki
wiki.api.special1       = Special
wiki.api.talk1          = Discuție
wiki.api.template1      = Format
wiki.api.templatetalk1  = Discuție_Format
wiki.api.user1          = Utilizator
wiki.api.usertalk1      = Discuție_Utilizator
wiki.api.categorytalk2  = Discuţie_Categorie
wiki.api.helptalk2      = Discuţie_Ajutor
wiki.api.image2         = Fişier
wiki.api.imagetalk2     = Discuţie_Fişier
wiki.api.mediawikitalk2 = Discuţie_MediaWiki
wiki.api.talk2          = Discuţie
wiki.api.templatetalk2  = Discuţie_Format
wiki.api.usertalk2      = Discuţie_Utilizator

The result retrieved is
Catedrala Mitropolitană]]

while it should return an empty string using plain text converter.

If you replace Fişier with Fisier the issue disappear hinting it is problem of 
the matching rules not dealing with non-ascii characters.

Original issue reported on code.google.com by rmyeid on 20 May 2013 at 10:29

GoogleCodeExporter commented 8 years ago
From the Properties javadoc you can see that that you have to use unicode 
escape sequences for special characters.

See:
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html

"The load(Reader) / store(Writer, String) methods load and store properties 
from and to a character based stream in a simple line-oriented format specified 
below. The load(InputStream) / store(OutputStream, String) methods work the 
same way as the load(Reader)/store(Writer, String) pair, except the 
input/output stream is encoded in ISO 8859-1 character encoding. Characters 
that cannot be directly represented in this encoding can be written using 
Unicode escapes ; only a single 'u' character is allowed in an escape sequence. 
The native2ascii tool can be used to convert property files to and from other 
character encodings."

See
http://docs.oracle.com/javase/6/docs/technotes/tools/windows/native2ascii.html

Original comment by axelclk@gmail.com on 25 May 2013 at 3:14

GoogleCodeExporter commented 8 years ago
It would be nice if there was a warning when the unicode characters are not 
escaped. I would expect unicode works as it is nowadays. Not to mention, that 
the properties files now are not readable.

Original comment by rmyeid on 25 May 2013 at 3:44