Tencent / rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
http://rapidjson.org/
Other
14.17k stars 3.52k forks source link

Request for clarification about the "copy" argument in Handler methods #1470

Open lelit opened 5 years ago

lelit commented 5 years ago

Hi all, I'm struggling trying to find the reason of a memory leak issue in the Python binding I'm maintaining.

The leak happens only in the streaming interface: I wrote a wrapper around a Python stream, that is used when the user passes a stream instead of a string object. Using the string interface I could not reproduce the issue in any way, even parsing thousands of times a moderately big JSON string...

After hours of analysis, I'm pretty confident that the problem is not related to the wrapper, but most probably is due to something else, and in particular to the different code paths that are selected when using a stream instead of a string: in the latter case the parsing is done using kParseInsituFlag, while the former uses the default behaviour. So I focused on trying to understand the logic behind that default strategy, and that consolidated a doubt about the fact that the reading Handler basically ignores the copy argument passed to some of its methods (see RawNumber and String).

The documentation about the handler did not help me: it says that "The last copy indicates whether the handler needs to make a copy of the string. For normal parsing, copy = true. Only when insitu parsing is used, copy = false." but it does not go into detail about what that actually means. I tried to follow RJ code, in particular to understand whether I should dispose the value when copy = true, but as my C++-fu is rather weak I could not figure that out. I blindly tried to free() that value, but that leaded to a memory error.

As I'm surely missing something, any help in better understanding the mentioned logic would be greatly appreciated!

miloyip commented 5 years ago

When using insitu parsing, the source JSON will be modified, such as adding null character to the end of the string in the source JSON. As such, you can make a DOM which directly point to that address, instead of (normally) copy the string from which the parser give you.

For insitu parsing, you need to keep that "source JSON" (although it has been modified and not a valid JSON anymore) until no pointers are pointed to the parsed strings in it. If you are using insitu parsing, you need to handle the life cycle of that "source JSON".

I didn't dig into the exact "memory leak" as you mentioned.

This may help: http://rapidjson.org/md_doc_dom.html#InSituParsing

lelit commented 5 years ago

Thank you.

So, am I right assuming that the handler should never free/dispose the str values it get, regardless the value of copy?

The Python binding module used insitu parsing since the beginning; I exposed the "stream" variant of the API some months ago, and that must use the normal way of parsing. It is when the module operates in this way that something isn't properly released and the memory consumption grows linearly.