Tencent / rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
http://rapidjson.org/
Other
14.24k stars 3.54k forks source link

Weird memory allocation behavior #1471

Open ashwinkapur opened 5 years ago

ashwinkapur commented 5 years ago

Hi,

This is my first time using RapidJSON so I may be doing something wrong. I was having what seemed like a memory leak using Document and I've reduced it to what seems like a really simple case.

I have a class called stock_streamer that is receiving messages from a library as const references to string and I'm trying to parse them. Since these are const references to string I can't use in situ parsing unless I copy the string over to my own char buffer but I didn't do that because I assumed that it would not necessarily be more efficient than letting the library do it.

The code. In the header for the class stock_streamer I have (I've tried many different sizes for val_buf_sz and parse_buf_sz):

    // what seems like relevant bits only. I can send the whole file if it would help.
    using DocumentType = rapidjson::GenericDocument<
        rapidjson::UTF8<>,
        rapidjson::MemoryPoolAllocator<>,
        rapidjson::MemoryPoolAllocator<>>;

    static const size_t val_buf_sz{32768};
    char val_buf[val_buf_sz];
    rapidjson::MemoryPoolAllocator<> val_alloc{&val_buf[0], sizeof(val_buf)};

    static const size_t parse_buf_sz{32768};
    char p_buf[parse_buf_sz];
    rapidjson::MemoryPoolAllocator<> parse_alloc{&p_buf[0], sizeof(p_buf)};

    DocumentType json_parser{&val_alloc, sizeof(p_buf), &parse_alloc};

I'm repeatedly calling the function:

void stock_streamer::on_message(const std::string& msg, size_t wire_size) {

    rapidjson::ParseResult ok = this->json_parser.Parse(msg);
    if(!ok) {     // NOLINT(readability-implicit-bool-conversion)
        const std::string err_msg = fmt::format(
            "Parsing Error Message {} Code: {} @ Offset: {}",
            msg, ok.Code(), ok.Offset());
        throw rbt_err::bad_msg(err_msg);
    }
    if(!this->json_parser.IsArray()) {
        const std::string err_msg = fmt::format(
            "Message is not a JSON array: {}", msg);
        throw rbt_err::bad_msg(err_msg);
    }

   // some commented out code here.

    const size_t val_buf_used = val_alloc.Size();
    if(val_buf_used > val_buf_sz) {
        auto logger = spdlog::get("logger");
        logger->warn("Allocated {} bytes from value allocator.", val_buf_used);
    }
    const size_t p_buf_used = parse_alloc.Size();
    if(p_buf_used > parse_buf_sz) {
        auto logger = spdlog::get("logger");
        logger->warn("Allocated {} bytes from parse allocator.", p_buf_used);
    }

    this->json_parser.SetNull();
    this->val_alloc.Clear();
    this->parse_alloc.Clear();
}

I've noticed that if I don't call this->parse_alloc.Clear() the number of bytes allocated by parse_alloc grows without bound with more messages. I deduce this because I get output saying:

Allocated bytes from parse allocator (where N is a monotonically increasing number).

However, I've also noticed that no matter what I set parse_buf_sz to, even when I call parse_alloc.Clear(); I always get output saying:

Allocated <parse_buf_sz + 256> bytes from parse allocator.

The messages (each a JSON array of small objects, frequently the array contains only one object) I'm processing are not large. The largest message I've seen is under 200 characters.

I get the same behavior using the default gcc compiler on Ubuntu 18.04 (g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0) on MacOS Sierra using the Apple system default clang (Apple LLVM version 10.0.0 (clang-1000.11.45.5)) and Macports clang (clang version 7.0.1 (tags/RELEASE_701/final)).

In my CMake file I have the following definitions: target_compile_definitions(test PRIVATE RAPIDJSON_48BITPOINTER_OPTIMIZATION=1 RAPIDJSON_HAS_STDSTRING=1 RAPIDJSON_SSE2=1 RAPIDJSON_SS42=1)

Is there any other information I could provide that would be helpful?

Am I doing something wrong? Or is this a bug and should I be doing something differently. Should I be doing something differently, i.e. is there a better way to use RapidJson here?

Thanks a ton.

Ashwin Kapur

miloyip commented 5 years ago

I could reproduce this unexpected behavior.

miloyip commented 5 years ago

I found that there were two issues:

  1. The MemoryPoolAllocator has a memory overhead, so its capacity is less than the size of buffer. DocumentType(&val_alloc, parse_alloc.Capacity(), &parse_alloc} should be the correct way. The user guide needed to be corrected.
  2. Both GenericDocument and GenericReader contains a stack for parsing, and they allocate from the same allocator. The default capacity of GenericDocument's stack is the number provided by user. But the one in GenericReader is GenericReader::kDefaultStackCapacity = 256. So the allocator must allocate more memory (256 bytes in addition to the user buffer). This is a design flaw, and needed to be fixed.

I considered these two are bugs but it only affects performance and do not affect the correctness of parsing.

gelldur commented 4 years ago

@miloyip please update documentation as example there also cause memory allocation https://rapidjson.org/md_doc_dom.html#UserBuffer