Tencent / rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
http://rapidjson.org/
Other
14.17k stars 3.52k forks source link

kParseNumbersAsStringsFlag fails with UTF16 input #1412

Open matttyson opened 5 years ago

matttyson commented 5 years ago

When parsing a UTF16 document using the kParseNumbersAsStringsFlag, only the first character of the string is provided to the RawNumber() handler. The remaining characters are null terminators.

This bug does not manifest if the document is UTF8, or if the document is UTF16 and the kParseInsituFlag is used.

I've provided a reproducer below, but I haven't found the cause of the bug as yet.

compile with -DWIDE to enable wide characters, and -DINSITU to enable insitu parsing Expected output is "500" in all cases

g++ -DWIDE -o main main.cpp
output: 5
g++ -DWIDE -DINSUTU -o main main.cpp
output: 500
g++ -o main main.cpp
output: 500

Below is the reproducer code

#include <rapidjson/rapidjson.h>
#include <rapidjson/reader.h>
#include <rapidjson/stream.h>

#include <cstdio>
#include <cwchar>

#ifdef WIDE
typedef rapidjson::UTF16<> utftype;
#define STR(x) L ## x
#define PUTOUT(x) putwchar((x))
#else
typedef rapidjson::UTF8<> utftype;
#define STR(x) (x)
#define PUTOUT(x) putchar((x))
#endif

#ifdef INSITU
constexpr rapidjson::ParseFlag pflags =
    static_cast<rapidjson::ParseFlag>(rapidjson::kParseNumbersAsStringsFlag | rapidjson::kParseInsituFlag);
#else
constexpr rapidjson::ParseFlag pflags = rapidjson::kParseNumbersAsStringsFlag;
#endif

class parser: public rapidjson::BaseReaderHandler<utftype, parser> {
public:
    parser(){}
    bool Null(){return true;}
    bool Bool(bool b){return true;}
    bool Int(int value){return true;}
    bool Uint(unsigned int value){return true;}
    bool Int64(int64_t value){return true;}
    bool Uint64(uint64_t value){return true;}
    bool Double(double value){return true;}
    bool String(const Ch* str, rapidjson::SizeType length, bool copy){return true;}
    bool Key(const Ch* str, rapidjson::SizeType length, bool copy){return true;}
    bool StartObject(){return true;}
    bool EndObject(rapidjson::SizeType memberCount){return true;}
    bool StartArray(){return true;}
    bool EndArray(rapidjson::SizeType elementCount){return true;}
    bool RawNumber(const Ch * str, rapidjson::SizeType length, bool copy)
    {
        for(int i = 0; i < length; i++){
            PUTOUT(str[i]);
        }
        PUTOUT(STR('\n'));
        return true;
    }
};

int main(int argc, char *argv[])
{
    utftype::Ch foo[] = STR("{\"Number\": 500}");
    parser p;

    rapidjson::GenericReader<utftype, utftype> r;
#ifdef INSITU
    rapidjson::GenericInsituStringStream<utftype> ss(foo);
#else
    rapidjson::GenericStringStream<utftype> ss(foo);
#endif

    rapidjson::ParseResult pr = r.Parse<pflags>(ss, p);

    return 0;
}
voropz commented 4 years ago

Faced the same issue. I found that both src and dst streams use the same memory (stack_) causing UTF-16 symbols overwrite UTF-8 source. Since all digits have 0 in the second byte in UTF-16, the result is just first correct symbol and zeros instead of the others. (reader.h)

SizeType numCharsToCopy = static_cast<SizeType>(s.Length());
StringStream srcStream(s.Pop());
StackStream<typename TargetEncoding::Ch> dstStream(stack_);
while (numCharsToCopy--) {
                    Transcoder<UTF8<>, TargetEncoding>::Transcode(srcStream, dstStream);
}         

@miloyip

ibulgakov commented 4 years ago

Hi! Faced this bug in our project. @miloyip, do you need any additional information? Is it possible to fix it?

nglass commented 4 months ago

This issue is duplicated and fixed by https://github.com/Tencent/rapidjson/issues/1923 and https://github.com/Tencent/rapidjson/pull/1926