Closed TkTech closed 1 year ago
* Use the new `parse_into_document()` and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.
Rather than making it raise an exception, it might be a nicer user experience to have the Python wrapper for the Parser
object emit a RuntimeWarning
, then create a new parser when the current parser still has references. This will avoid breaking existing code, while providing a way for developers to find these bugs and fix them. It also avoids any extra performance overhead unless it's needed for the code to work as expected.
- Use the new
parse_into_document()
and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.
Getting back to this - I have some long lived objects created from JSON that are lazily constructed when needed (most of them are not). So what I used to do so far using orjson was simply parse JSON messages as they are received, create those objects and pass them on to the caching and lazy construction logic. I now wanted to switch to pysimdjson, but the one document per parser limitation means my use case won't work.
My naive assumption would be that the Object and Array proxies hold a reference to the document and it is freed when no longer referenced.
So would something like this work:
That would make all memory handling transparent to the user and "simply work". If used in the currently supported way, there's no performance impact, if multiple parsed documents are required, you pay the price, whatever that may be.
Alternatively, could an API be exposed that allows to parse into new documents manually?
Thanks!
I was poking around this library to use for a project and stumbled upon this issue. I think the below may work out for you.
@TkTech it that may save you some headaches dealing with not only this problem but with your thread safety problem is to take an implementation detail out of http libraries. You can instead of creating your object directly just return a parser via with. That would let you isolate the creation and destruction from the entrance and exiting of it. You can use a simple queue to hold all the parsers instead of just letting them get gc'ed. If there are no parsers in the queue you can create a new one.
That may also help you with the PyPy issue as you can not return the object to the queue until after its references are gone. Though TBH you probably want to just throw an error if an Object or Array hasn't been cleaned up when you close the parser.
That would just change your access api to something along the following
with simdjson.getparser() as parser:
parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')
You can then just in getparser check if there is one available in a queue or create one if there is not. When you close the parser you simply return it to the queue to be reused. LMK if I wasn't being clear anywhere I can make u some pseudocode that will better show what I am talking about.
Avoided entirely by using parse_into_document
and removing long-lived object proxies in #110.
We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.
It has downsides:
del
the old objects, even if you didn't intend to use them again. Very un-pythonic.del
is unreliable. The objects may not be garbage collected until much later.Brainstorming welcome. Alternatives:
parse()
is called with a different document. The performance of this would probably be unacceptable - I might be wrong.parse_into_document()
and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.