Improve UTF-8 handling in Coco parsers

GoogleCodeExporter commented 9 years ago

Today, the JSON parser (for example) does not actually provide UTF-8 characters 
on output, even if 
UTF-8 comes in.  Either mod_ndb should either correctly handle the unicode 
wchar_t provided by 
Coco, or we should extend Coco to provide access back to the original UTF-8 
text in the scanner.

Original issue reported on code.google.com by john.david.duncan on 14 May 2009 at 8:23

GoogleCodeExporter commented 9 years ago

I generally want to get tokens out of my parser as char *, not wchar *.  So, I 
have been using 
coco_string_create_char(t->val).  But this is lossy -- from UTF-8 input I will 
not get UTF-8 output!  So I think 
I'm looking for a way to get the original (char) token from the input stream, 
rather than the (wchar) token 
used by the parser...

JD

Hi!

That should be possible. The token already stores the start position in the 
original file. You can read the 
original string from there. The length in characters you can get from the 
token->val (or store the length into 
the token in Scanner::NextToken for a better performance).

To read the raw string, you could extend the Buffer with a fitting method.

Original comment by john.david.duncan on 14 May 2009 at 8:24

GoogleCodeExporter commented 9 years ago

See r576

Original comment by john.david.duncan on 6 Jun 2009 at 2:19

Changed state: Started
Added labels: Milestone-1.1
Removed labels: Milestone-1.2

GoogleCodeExporter commented 9 years ago

Original comment by john.david.duncan on 10 Jun 2009 at 3:38

Changed state: FixPending

GoogleCodeExporter commented 9 years ago

Original comment by john.david.duncan on 25 Jun 2009 at 3:59

Changed state: Fixed

dannote / mod-ndb

Improve UTF-8 handling in Coco parsers #75