Feature request: non-recursive `GenericValue` destructor

itrofimow commented 1 year ago

Hi!

There is a kParseIterativeFlag, which prevents stack exhaustion when parsing deeply nested json data, however destruction of the parsed value is still recursive and might blow the stack up in case of extremely deeply nested objects or small stacks.

We at https://github.com/userver-framework/userver use stackfull coroutines with the stacks way less than 8Mb, and we have a workaround for destroying GenericDocument iteratively, which is a bit ugly, but at least it works.

Here comes the problem:

Given following code

GenericDocument<...> json;
json.Parse<kParseIterativeFlag>(data, size);

and the json data in form

{
    "o": <deeply nested>,
    "non-parseable nonsense"
}

the "o" object would be destroyed in the Parse function itself, before we are able to wrap it into non-recursive destruction routine, and with depth big enough it would blow the stack up, which effectively defies the purpose of kParseIterativeFlag at all.

One workaround for that could be a depth limit for Parse, however it would still leave some room for constructing jsons nested deeper than the aforementioned limit, so what i propose is a compile-time flag to destroy GenericDocument iteratively, to avoid potentially dangerous recursion once and for all.

aikawayataro commented 1 year ago

GenericDocument's destructor is not recursive it delete's the allocator and that's it. The default allocator MemoryPoolAllocator's destructor is not recursive too. But GenericDocument inherits GenericValue whose destructor is recursive. So I think your concern is GenericValue's destructor but it's recursive only if Allocator::kNeedFree == true which is not true with the default MemoryPoolAllocator. Any further info or tests?

itrofimow commented 1 year ago

Hi @aikawayataro!

Yes, sorry, you're right of course, this is all about GenericValue destructor, i'll rename the issue accordingly,

Indeed, this is not an issue with the MemoryPoolAllocator, however i have a decently large codebase with CrtAllocator used all over the place (to allow mixing values created by different instances of the CrtAllocator, to allow Document -> Value conversion https://github.com/Tencent/rapidjson/issues/387, etc. etc.), and i don't think it would be feasible for me to change that.

This is easily reproducible with the following code:

#include <string_view>

#include <rapidjson/document.h>

namespace impl {
::rapidjson::CrtAllocator g_allocator;

using UTF8 = ::rapidjson::UTF8<char>;
using Document = ::rapidjson::GenericDocument<UTF8, ::rapidjson::CrtAllocator,
                                              ::rapidjson::CrtAllocator>;
} // namespace impl

void Parse(std::string_view data) {
  impl::Document json{&impl::g_allocator};
  rapidjson::ParseResult ok =
      json.Parse<rapidjson::kParseDefaultFlags |
                 rapidjson::kParseIterativeFlag |
                 rapidjson::kParseFullPrecisionFlag>(data.data(), data.size());

  if (!ok) {
    exit(1);
  }
}

Now if i feed into Parse data in the following format

{
    "o": {
        "o": {
            "o": {
                <nesting goes on>
            }
        }
    },
    "nonsense"
}

, which could be generated by the following code, for example:

#include <iostream>
#include <string>

std::string GenerateNestedJson(std::size_t depth) {
    std::string result{};
    result.reserve(depth * 8);

    for (std::size_t i = 0; i < depth; ++i) {
        result.append("{\"o\":");
    }
    result.append("{}");
    for (std::size_t i = 0; i < depth; ++i) {
        result.push_back('}');
        if (i + 2 == depth) {
            result.append(",\"nonsense\"");
        }
    }

    return result;
}

int main() {
    std::size_t depth; std::cin >> depth;
    std::cout << GenerateNestedJson(depth) << std::endl;

    return 0;
}

, the Parse function is guaranteed to segfault with the depth big enough: it takes depth of around 135000 (with the data being around 800Kb) with the default stack size of 8Mb, and with the stack size of 256Kb (set by ulimit -s 256, for example) it takes depth of ~4000 with the data being just 25Kb.

aikawayataro commented 12 months ago

Okay, I understand your point.

The problem is that in order to implement a new flag for a value, the internal API have to be changed, including the GenericValue constructor, which would add complexity for GenericDocument. Unfortunately RapidJSON doesn't have a template API to use user-provided Value class which makes it difficult to add this kind of functionality in usercode. If you have an idea how to implement this feature clean please share your thoughts.

I see the solution to your problem is to use GenericReader to read your JSON, build the Document manually using Handler, and in case of an error perform your cleanup procedure.

void IterativeClean(Value &value);

int main() {
    std::string_view json = "{}";

    struct MyHandler : public BaseReaderHandler<UTF8<>, MyHandler> {
        MyHandler(Document &doc) : doc_(doc) {
            stack.push(&doc_);
        }

        bool Default() {
            return true;
        }

        // Implement callbacks from BaseReaderHandler here.

        std::stack<Value *> stack;
        Document &doc_;
    };

    MemoryStream json_stream = MemoryStream(json.data(), json.size());
    Reader reader;

    Document doc;
    MyHandler handler(doc);

    auto res = reader.Parse<kParseIterativeFlag>(json_stream, handler);
    if (res.IsError()) { // Parsing failed
        IterativeClean(doc);
    }

    // ...

    IterativeClean(doc);

    return 0;
}

aikawayataro commented 12 months ago

A simpler and faster code can also be achieved by using GenericDocument::Populate, but in case of parsing error you have to "close" all incomplete nodes, such as key-values, objects, and arrays.

itrofimow commented 12 months ago

Yes, the problem with GenericDocument<>::Parse() could be solved via SAX parsers, but this was merely an example, and the much broader problem is that a GenericValue could be destroyed recursively somewhere, anywhere.

At userver we try really hard to wrap any native rapidjson::GenericValue into some RAII type, which destroys the value iteratively, like this, but this gets very complicated very quickly:

any time rapidjson::GenericValue is passed around extreme cautious is required to not accidentally let it be destroyed by its native destructor
any time a function from rapidjson:: is called we have to make sure it doesn't destroy the value (or its children) in any execution path. This invariant is basically impossible to maintain
even if we could ensure that none of the above happens with the current rapidjson version, it would be very hard for us to bump the version: we would have to validate virtually all the functionality we have atop of rapidjson, and we have a lot of it

We consider this a security threat (carefully crafted data sent by an attacker could potentially lead to a DoS) for us, and are willing to provide a patch (we vendor rapidjson and we'll likely patch it anyway). The easiest and the least intrusive way i can think of is something in the lines of

~GenericValue() {
#ifdef RAPIDJSON_ITERATIVE_VALUE_DESTRUCTOR
    DestroyIteratively(...); // this has to be implemented
#else
    ... current implementation
#endif
}

We understand that iterative destruction would be measurably slower, but we are ok with that.

What do you think?

aikawayataro commented 12 months ago

I think you should submit a pull request then. To avoid misunderstandings, I am not the maintainer of this repository. You need to define your destroyer in the internal namespace, add the ability to enable it via option() in CMake, and introduce a test if necessary.

Tencent / rapidjson

Feature request: non-recursive `GenericValue` destructor #2217