Problem serializing some strings

juceda commented 1 year ago

Recently I have had a serious problem using the daw json library. I manage strings that can have some bytes inside, not 7bit or printable string.

I have observed that depending on the sequence of bytes inside the string, the to_json function gets blocked, the program begins to consume a lot of memory until finally the linux kernel kills it due to lack of resources (OOM).

I've made an quick example to reproduce it:

struct test_document {
    std::string_view in;
};

namespace daw::json {
    template<>
    struct json_data_contract<test_document> {
        static inline constexpr char const _in[] = "in";
        using type = json_member_list<
        json_string<_in, std::string_view>
        >;

        static constexpr auto to_json_data( test_document const &doc ) {
            return std::forward_as_tuple( doc.in);
        }
    };
}

int main(int argc, char **argv) {
    std::cout << "Building New Request..." << std::endl;

    const std::string_view test_string("\x0D\x0ATEST\xFETEST\x0D\x0A");
    test_document document({
        .in = test_string
    });

    const std::string request_json_payload(daw::json::to_json(document));

    std::cout << "Request JSON: " << request_json_payload << std::endl;
    return 0;
}

I expected request_json_payload to be {"in":"\r\nTEST\u00FETEST\r\n"} but instead of getting that result, the function to_json "hangs", eating memory quickly, and normally the program crashes killed by the kernel as I said before.

In that case, the problem is specifically on 0xFE byte. In fact, if we change test_string to simply ("\xFE") the same thing happens to the serializer.

I've tried to change the member to be json_string_raw, but it's not an option since the other side needs the bytes to be encoded. I tried to investigate and see there is some ?hidden? options like EightBitMode and Encode. Don't know if playing with this, I will be able to bypass this issue.

For now, I've made a quick&dirty workaround, by making a simple !isprint() -> \uXXXX encoding and passing it as a raw string to daw, but I'm interested in knowing what the problem is and if it could be a bug, I hope this report will be useful for you.

Thanks in advance!

beached commented 1 year ago

thanks, ill look into this. I should either fail via error, or complete successfully not consume resources like that. https://jsonlink.godbolt.org/z/vfPrbhz9r

beached commented 1 year ago

I have found the issue in the unicode iterator I use, when hitting these invalid CP's it will increment 0 bytes into the raw stream and that is what is causing the resource exhaustion.

I would like the option of encoding it as a \u.... but also calling the error handler as config option as normally this is unwanted.

juceda commented 1 year ago

I figured things were going that way. But I am not clear if encoding bytes like \u00FE as is, it cannot raise another problem later.

Eg: "\xFE0123" (FE byte, ascii '0', ascii '1', ascii '2', ascii '3') Encoded as \u00FE0123, someone could take back for parsing as 32 bit long 00FE0123... since {} delimiters are not available on all languages. Crazy thing.

I'm refresing my knowledge about unicode character sequences and representations...

FYI, I just tested json_encode on PHP and failed to encode "\x0D\x0ATEST\xFETEST\x0D\x0A", returning the following error: Malformed UTF-8 characters, possibly incorrectly encoded.

So if some bytes are not expected, or might produce undefined behavior, seems to be fine to raise an error.

For my particular case, right now I have it patched as I mentioned, but I may have to tell clients that if they keep sending bytes that are not printable, instead of a string we will have to use a byte/number array to avoid problems (or B64 string).

Thank you Darrell.

beached commented 1 year ago

I am going to support it, there are options like this already for compatability. The default will def be an error though.

beached commented 1 year ago

Got a fix for this in the works https://github.com/beached/daw_json_link/pull/391 The default is going to be an error, but there is an option to escape when encountered too that gives the expected output.

beached commented 1 year ago

https://github.com/beached/daw_json_link/releases/tag/v3.19.0 Has the fixes. Again, thanks for reporting.

beached / daw_json_link

Problem serializing some strings #389