Closed DJuego closed 6 years ago
It looks to me like the above JSON document is not valid UTF-8. If a JSON document is not valid UTF-8, it's not valid JSON either, so it's correct for sajson to fail to parse it. If it's not valid UTF-8, what do you mean that you don't want to lose the original characters? What are the original characters?
Thank you for your very swift answer, @chadaustin
It looks to me like the above JSON document is not valid UTF-8. If a JSON document is not valid UTF-8, it's not valid JSON either, so it's correct for sajson to fail to parse it.
I understand it. Of course. I am a real noob with codifications and text processing. :-/
I have solved my problem with this code:
std::string ISO88959ToUTF8(const char *str)
{
std::string utf8("");
utf8.reserve(2 * strlen(str) + 1);
for (; *str; ++str)
{
if (!(*str & 0x80))
{
utf8.push_back(*str);
}
else
{
utf8.push_back(0xc2 | ((unsigned char)(*str) >> 6));
utf8.push_back(0xbf & *str);
}
}
return utf8;
}
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
int main()
{
std::string json_ori = "{\"name\":\"áéíóúñ¡¿\"}";
std::string json = ISO88959ToUTF8(json_ori.c_str());
size_t longitud = json.length();
char m_json[32];
strcpy(m_json, json.c_str());
const sajson::document& document = sajson::parse(sajson::dynamic_allocation(),
sajson::mutable_string_view(longitud, m_json));
if (!document.is_valid())
{
return -1;
}
std::string name = UTF8toISO8859_1(document.get_root().get_object_key(0).as_string().c_str());
std::string value = UTF8toISO8859_1(document.get_root().get_object_value(0).as_cstring());
return 0;
}
I do not know if it is possible to improve it but at least it does exactly what I wanted. I would love any suggestion or alternative approach. :-)
DJuego
If you know your input data is ISO-8859-1, this approach seems fine to me!
Hi! I am a spanish newbie with string encoding,
I try:
I get: ERROR_INVALID_UTF8 (22)
How can solve this problem? I have test utilities like http://utfcpp.sourceforge.net/ for ensuring that a string contains valid UTF-8. And then it works! But... i don´t want to lose the original characters,,, (!)
Thanks!
DJuego