allen58992008 / touchcode

Automatically exported from code.google.com/p/touchcode
0 stars 0 forks source link

TouchXML: UTF8 string with non-latin chars appears wrong after parsing #57

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Init CXMLDocument with some UTF8 string having non-Latin characters. I
tried to parse a Russian website this way. The website is using CP1251
encoding, so I converted the NSData with the html page to NSString using
stringWithData:encoding. The string looked great (I saw Russian chars as
Russian chars in XCode).
2. Try to parse the CXMLDocument using XPath. 

What is the expected output? What do you see instead?

The nodes come out with Russian chars unrecognizable, i.e. they look like
ÖÑÊÀ âûøåë â ôèíàë instead.

What version of the product are you using? On what operating system?

The latest as of April 28, 2009, whatever it was.

Please provide any additional information below.

I guess, this has something to do with converting NSString to char* using
UTF8String encoding in the CXMLDocument initWithString method. Somehow, in
the end the correct encoding is lost. NSXMLDocument has no such problem,
Russian still looks Russian after being parsed.

Original issue reported on code.google.com by yurypetr...@gmail.com on 1 May 2009 at 7:23

GoogleCodeExporter commented 8 years ago
Please provide sample code and sample data, see: 
http://code.google.com/p/touchcode/wiki/BugSubmission

Thanks!

Original comment by jwight on 1 May 2009 at 7:51

GoogleCodeExporter commented 8 years ago
Yury, John,

Attached you'll find unit tests should help to reproduce the problem. 

Yury, tests are showing that the problem only happens with CXMLDocument's 
initWithData:options:error:. For me, 
initWithXMLString:options:error: is working fine, contrary to what your report 
describes.

I went on to check NSXMLDocument. Its initWithData:options:error: doesn't parse 
data which isn't proper UTF-8 but 
does fine when using initWithString:options:error: much(1) like current 
CXMLDocument's implementation. I ain't 
seeing differences here so, this could be a won't fix in order to keep 
CXMLDocument 1:1 compatible with 
NSXMLDocument's API.

One workaround is to convert the NSData to NSString using the fancy encoding 
and then work from there. Nevertheless, as a proof of concept, I attach a patch 
to CXMLDocument that accepts encoding on its data initialiser in 
order to correctly parse NSData with encodings other than UTF-8. The patch is 
retro-compatible. 

(1) On encoding errors, current CXMLDocument actually goes on with the parsing 
and returns a document omitting 
the encoding error, alas NSXMLDocument return nil document and an error. This 
is subject for another issue, though.

Original comment by jpedroso@gmail.com on 11 May 2009 at 2:04

Attachments:

GoogleCodeExporter commented 8 years ago
Jorge, thanks for the unit tests and new method for CXMLDocument.

Hopefully people will find the new method handy - I've accepted the patch and 
it is in 
the repository now.

Closing this bug as fixed. Yury, please try the new API.

Original comment by jwight on 13 May 2009 at 7:13

GoogleCodeExporter commented 8 years ago
Oh and if you want to become a project commiter Jorge let me know. Really happy 
to 
add commiters who write unit tests :-)

Original comment by jwight on 13 May 2009 at 7:15

GoogleCodeExporter commented 8 years ago
touchJSON has the same issue. Looking through the code, I cant find a place to 
make a similar modification, as 
everything is using NSUTF8StringEncoding

Original comment by sircambr...@gmail.com on 6 Jun 2009 at 12:13

GoogleCodeExporter commented 8 years ago
nevermind :) I was using it wrong. I was using NSSting stringWithContentsOfURL, 
then converting to NSData with 
utf8 encoding, then feeding it to the parser, which.... "double decodes" the 
utf8 ?

Original comment by sircambr...@gmail.com on 6 Jun 2009 at 12:40