chadaustin / sajson

Lightweight, extremely high-performance JSON parser for C++11
MIT License
565 stars 41 forks source link

Question about invalid UTF-8 #45

Closed DJuego closed 6 years ago

DJuego commented 6 years ago

Hi! I am a spanish newbie with string encoding,

I try:

int main()
{
    std::string json = "{\"name\":\"áéíóúñ¡¿\"}";
    size_t longitud = json.length();

    char m_json[32];
    strcpy(m_json, json.c_str());

    const sajson::document& document = sajson::parse(sajson::dynamic_allocation(), 
        sajson::mutable_string_view(longitud, m_json));
    if (!document.is_valid())
    {
        return -1;
    }
        return 0;
}

I get: ERROR_INVALID_UTF8 (22)

How can solve this problem? I have test utilities like http://utfcpp.sourceforge.net/ for ensuring that a string contains valid UTF-8. And then it works! But... i don´t want to lose the original characters,,, (!)

Thanks!

DJuego

chadaustin commented 6 years ago

It looks to me like the above JSON document is not valid UTF-8. If a JSON document is not valid UTF-8, it's not valid JSON either, so it's correct for sajson to fail to parse it. If it's not valid UTF-8, what do you mean that you don't want to lose the original characters? What are the original characters?

DJuego commented 6 years ago

Thank you for your very swift answer, @chadaustin

It looks to me like the above JSON document is not valid UTF-8. If a JSON document is not valid UTF-8, it's not valid JSON either, so it's correct for sajson to fail to parse it.

I understand it. Of course. I am a real noob with codifications and text processing. :-/

I have solved my problem with this code:


std::string ISO88959ToUTF8(const char *str)
{
    std::string utf8("");
    utf8.reserve(2 * strlen(str) + 1);

    for (; *str; ++str)
    {
        if (!(*str & 0x80))
        {
            utf8.push_back(*str);
        }
        else
        {
            utf8.push_back(0xc2 | ((unsigned char)(*str) >> 6));
            utf8.push_back(0xbf & *str);
        }
    }
    return utf8;
}

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

int main()
{
    std::string json_ori = "{\"name\":\"áéíóúñ¡¿\"}";
    std::string json =  ISO88959ToUTF8(json_ori.c_str());

    size_t longitud = json.length();

    char m_json[32];
    strcpy(m_json, json.c_str());

    const sajson::document& document = sajson::parse(sajson::dynamic_allocation(), 
        sajson::mutable_string_view(longitud, m_json));
    if (!document.is_valid())
    {
        return -1;
    }

       std::string name = UTF8toISO8859_1(document.get_root().get_object_key(0).as_string().c_str());
       std::string value = UTF8toISO8859_1(document.get_root().get_object_value(0).as_cstring());

       return 0;
}

I do not know if it is possible to improve it but at least it does exactly what I wanted. I would love any suggestion or alternative approach. :-)

DJuego

chadaustin commented 6 years ago

If you know your input data is ISO-8859-1, this approach seems fine to me!