TkTech / pysimdjson

Python bindings for the simdjson project.
https://pysimdjson.tkte.ch
Other
643 stars 54 forks source link

Improve user experience of memory safety. #82

Closed TkTech closed 1 year ago

TkTech commented 3 years ago

We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

It has downsides:

Brainstorming welcome. Alternatives:

chutz commented 3 years ago
* Use the new `parse_into_document()` and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Rather than making it raise an exception, it might be a nicer user experience to have the Python wrapper for the Parser object emit a RuntimeWarning, then create a new parser when the current parser still has references. This will avoid breaking existing code, while providing a way for developers to find these bugs and fix them. It also avoids any extra performance overhead unless it's needed for the code to work as expected.

kaber2 commented 2 years ago
  • Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Getting back to this - I have some long lived objects created from JSON that are lazily constructed when needed (most of them are not). So what I used to do so far using orjson was simply parse JSON messages as they are received, create those objects and pass them on to the caching and lazy construction logic. I now wanted to switch to pysimdjson, but the one document per parser limitation means my use case won't work.

My naive assumption would be that the Object and Array proxies hold a reference to the document and it is freed when no longer referenced.

So would something like this work:

That would make all memory handling transparent to the user and "simply work". If used in the currently supported way, there's no performance impact, if multiple parsed documents are required, you pay the price, whatever that may be.

Alternatively, could an API be exposed that allows to parse into new documents manually?

Thanks!

zrothberg commented 1 year ago

I was poking around this library to use for a project and stumbled upon this issue. I think the below may work out for you.

@TkTech it that may save you some headaches dealing with not only this problem but with your thread safety problem is to take an implementation detail out of http libraries. You can instead of creating your object directly just return a parser via with. That would let you isolate the creation and destruction from the entrance and exiting of it. You can use a simple queue to hold all the parsers instead of just letting them get gc'ed. If there are no parsers in the queue you can create a new one.

That may also help you with the PyPy issue as you can not return the object to the queue until after its references are gone. Though TBH you probably want to just throw an error if an Object or Array hasn't been cleaned up when you close the parser.

That would just change your access api to something along the following

with simdjson.getparser() as parser:
       parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

You can then just in getparser check if there is one available in a queue or create one if there is not. When you close the parser you simply return it to the queue to be reused. LMK if I wasn't being clear anywhere I can make u some pseudocode that will better show what I am talking about.

TkTech commented 1 year ago

Avoided entirely by using parse_into_document and removing long-lived object proxies in #110.