Open doerwalter opened 1 month ago
@doerwalter I won't be able to respond in detail now, but in the meantime I'll point you to #44. I'll try to answer more carefully tomorrow.
Thanks @doerwalter for the wait. I haven't had as much time on my hands as I'd like, but I still wrote the message above to signal that I didn't intend to forget about this issue. Today I finally could sit down and write a proper reply. Sorry for the wall of text, but this is a good opportunity to do some brain dumps and somewhat document things I guess.
Firstly, thanks again for the very detailed message, and presenting the problem very clearly.
Long story short, the main issue to support generators or iterables as inputs is that it cannot be supported in the "generic" routines (see below), as this would break the support ijson currently has for event interception. This feature works based on premise that users can provide a generator to ijson as an input, with the generator returning 2- or 3-tuples (rather than strings or bytes.)
... So to be able to parse that JSON with ijson I have to wrap this generator in a file like object. I'm doing it like this: [code] ... The read method has to ensure that the strings it returns have the correct size...
Just one comment here: ijson never reads with size==-1
(that's the whole point! :wink:), so if you're using this class with ijson only then you can relax your code. The size
argument is also usually a maximum, it doesn't have to be the exact number, and users/libraries (and ijson certainly does) should always check the length of the returned data. It wouldn't change your code much I think anyway.
... nothing gets output... ... My guess would be that that is caused by the default buffer size used by ijson being 64K... ... However I don't care about the input being parsed in 10-character chunks ... the back and forth between Python and C drowns out the performance advantage of the C parser... ... For my example the result was that using the pure Python parser is as fast as using yajl2_c
You are correct that in your case the buffer size is what's preventing the parsing process from proceeding, and indeed using a smaller buffer size for your workaround is the way to go, even if that creates more back and forth to/from the underlying parser. In any case, yours is a very small JSON document, so it makes sense that the backend you choose makes no humanly noticeable difference (there sure is a difference, but it will be minimal).
About the back and forth: that's exactly what a strings/bytes iterator input would end up doing too: each you the user yields some bytes/str, we'd pass that through the parser. The main difference is that users don't need to pass a buf_size
anymore, as each yielded str/bytes can be of different size, and will be passed to the parser as-is.
I tried to add support for parsing from generators to ijson, but I'm really out of my depths here. What I tried to do was: [code]
That'd be kind of a good first approach, but like I say below there's no way of doing this in the generic entry points. What we need to do is add new specialised entry points.
I also don't know what the difference between basic_parse and basic_parse_gen is
basic_parse_gen
only knows how to deal with file-like objects, basic_parse_async
only with async file-like objects, and so on. OTOH, basic_parse
detects the input type and dispatches to the correct specialised routine, which is why you can do both async for _ in ijson.basic_parse(...)
and for _ in ijson.basic_parse(...)
from the same routine, depending on the input. The other routines (items
, kvitems
and parse
) follow the same logic.
There's no way (I can see) where we add support for str/bytes iterables as inputs to basic_parse
(the generic routine) without breaking support for event interception though, but OTOH event interception is a somewhat niche feature. I'm not sure whether string iterables as inputs would be more or less popular that even interception, so I'm not initially inclined to break one to support the other.
What can be done however, and I hadn't thought about this during #44 I think, is to add a specialised method that deals only with bytes/string iterables as inputs (e.g., basic_parse_iterable
or similar). This would have to be done for all methods. The implementation should be (relatively) trivial. The yajl2_c
backend is a bit special and might require a bit more of care, but nothing unsurmountable.
the parser seems to expect bytes ... [adds UTF8 encoding] ... but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser
I wouldn't care about the Python parser (it's miles behind the others in terms of performance). And also in general encoding/decoding usually isn't the bottleneck, although it can make a measurable difference (and still by principle you want to be as efficient as possible). But for more reference...
The yajl
parsers (and an experimental Boost.JSON backend I baked some time ago) all expect bytes (UTF8-encoded bytes, because that's what JSON should be encoded in), and therefore the library works at its best when users provide bytes, which is also usually what you get from network operations like HTTP traffic, of from reading files -- so it should be a win-win. When users provide file-like sources that return strings on read
, we emit a warning and internally UTF8-encode them before passing them to the backend. I'd do something similar for string/bytes iterators: I'd expect them to give bytes preferably, otherwise we'll end up internally encoding them and warning about it.
In the case of the library you're using, does it have any option to return the non-decoded bytes instead? The fact that it is returning strings with fragments of a raw JSON document means it is doing some unnecessary extra work to begin with.
Yes, the Python backend internally uses str
for parsing. I settled on that solution because, at least at the time, it was (unintuitively) faster: backends always return strings, not bytes, for string objects in the JSON document, and the python backend was faster when decoding the full JSON document, then extracting those string objects, than when dealing with bytes
as input and only decoding the string objects that needed to be given to the user. I have had the intention to undo this, and let the Python backend deal with bytes internally to remove some complexity, even at the expense of degraded performance (again, nobody should expect the Python backend to be fast). The README even used to say (second bullet point in FAQ #5) that this internal implementation detail might change in the future.
@doerwalter just a final remark about the size
argument for read
: you can even be laxer, and just return await anext(self.iterator, b"")
in your read
method. Again, ijson will check the size of what you return and work with that.
Thanks @doerwalter for the wait. I haven't had as much time on my hands as I'd like, but I still wrote the message above to signal that I didn't intend to forget about this issue. Today I finally could sit down and write a proper reply. Sorry for the wall of text, but this is a good opportunity to do some brain dumps and somewhat document things I guess.
Thanks for the detailed reply, this is really appreciated.
Firstly, thanks again for the very detailed message, and presenting the problem very clearly.
For context, this is the feature request I made in langchain
related to ijson
: https://github.com/langchain-ai/langchain/discussions/27911
Long story short, the main issue to support generators or iterables as inputs is that it cannot be supported in the "generic" routines (see below), as this would break the support ijson currently has for event interception. This feature works based on premise that users can provide a generator to ijson as an input, with the generator returning 2- or 3-tuples (rather than strings or bytes.)
OK, so passing a generator to basic_parse()
etc. already means something else. In theroy ijson
could check to see what comes out of the generator and dispatch on that, but that would be a very messy API.
... So to be able to parse that JSON with ijson I have to wrap this generator in a file like object. I'm doing it like this: [code] ... The read method has to ensure that the strings it returns have the correct size...
Just one comment here: ijson never reads with
size==-1
(that's the whole point! 😉), so if you're using this class with ijson only then you can relax your code. Thesize
argument is also usually a maximum, it doesn't have to be the exact number, and users/libraries (and ijson certainly does) should always check the length of the returned data. It wouldn't change your code much I think anyway.
It feels wrong to me to ignore the size
argument. And indeed when I replace:
async def read(self, size : int = -1) -> str:
while size == -1 or len(self.buffer) < size:
try:
self.buffer += await anext(self.iterator)
except StopAsyncIteration:
break
if size == -1:
(result, self.buffer) = (self.buffer, "")
else:
(result, self.buffer) = (self.buffer[:size], self.buffer[size:])
return result
with
async def read(self, size : int = -1) -> str:
return await anext(self.iterator, b"")
my test script hangs.
... nothing gets output... ... My guess would be that that is caused by the default buffer size used by ijson being 64K... ... However I don't care about the input being parsed in 10-character chunks ... the back and forth between Python and C drowns out the performance advantage of the C parser... ... For my example the result was that using the pure Python parser is as fast as using yajl2_c
You are correct that in your case the buffer size is what's preventing the parsing process from proceeding, and indeed using a smaller buffer size for your workaround is the way to go, even if that creates more back and forth to/from the underlying parser. In any case, yours is a very small JSON document, so it makes sense that the backend you choose makes no humanly noticeable difference (there sure is a difference, but it will be minimal).
Using ijson
in the (langchain
) context is such a big step forward, that we probably shouldn't have to worry about remaining potential performance improvements, but I was hoping that supporting input from generators would be easy to implement, which doesn't seem like it.
About the back and forth: that's exactly what a strings/bytes iterator input would end up doing too: each you the user yields some bytes/str, we'd pass that through the parser. The main difference is that users don't need to pass a
buf_size
anymore, as each yielded str/bytes can be of different size, and will be passed to the parser as-is.
Exactly. The advantage would be that we would neither have to feed shorter input to the parser (because size
was smaller that the chunk we've got), nor that we would have to wait for further input (because size
was larger than the chunk we've got).
I tried to add support for parsing from generators to ijson, but I'm really out of my depths here. What I tried to do was: [code]
That'd be kind of a good first approach, but like I say below there's no way of doing this in the generic entry points. What we need to do is add new specialised entry points.
OK.
I also don't know what the difference between basic_parse and basic_parse_gen is
basic_parse_gen
only knows how to deal with file-like objects,basic_parse_async
only with async file-like objects, and so on. OTOH,basic_parse
detects the input type and dispatches to the correct specialised routine
Ah, OK, now the naming scheme makes sense! ;)
, which is why you can do both
async for _ in ijson.basic_parse(...)
andfor _ in ijson.basic_parse(...)
from the same routine, depending on the input. The other routines (items
,kvitems
andparse
) follow the same logic.There's no way (I can see) where we add support for str/bytes iterables as inputs to
basic_parse
(the generic routine) without breaking support for event interception though, but OTOH event interception is a somewhat niche feature. I'm not sure whether string iterables as inputs would be more or less popular that even interception, so I'm not initially inclined to break one to support the other.What can be done however, and I hadn't thought about this during #44 I think, is to add a specialised method that deals only with bytes/string iterables as inputs (e.g.,
basic_parse_iterable
or similar). This would have to be done for all methods. The implementation should be (relatively) trivial. Theyajl2_c
backend is a bit special and might require a bit more of care, but nothing unsurmountable.
Sounds good to me.
the parser seems to expect bytes ... [adds UTF8 encoding] ... but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser
I wouldn't care about the Python parser (it's miles behind the others in terms of performance). And also in general encoding/decoding usually isn't the bottleneck, although it can make a measurable difference (and still by principle you want to be as efficient as possible). But for more reference...
OK, when we ignore the Python parser, there's really no feasible way around encoding and decoding, since the C parser expects bytes.
The
yajl
parsers (and an experimental Boost.JSON backend I baked some time ago) all expect bytes (UTF8-encoded bytes, because that's what JSON should be encoded in), and therefore the library works at its best when users provide bytes, which is also usually what you get from network operations like HTTP traffic, of from reading files -- so it should be a win-win. When users provide file-like sources that return strings onread
, we emit a warning and internally UTF8-encode them before passing them to the backend. I'd do something similar for string/bytes iterators: I'd expect them to give bytes preferably, otherwise we'll end up internally encoding them and warning about it.
This sounds like a useful approach, except that we could drop the warning, and document the support for mixed str/bytes as a feature. ;)
In the case of the library you're using, does it have any option to return the non-decoded bytes instead? The fact that it is returning strings with fragments of a raw JSON document means it is doing some unnecessary extra work to begin with.
Well the type annotations for the method I'm overwriting in langchain
s JSON parser suggest that that's not possible:
def _transform(self, input: Iterator[Union[str, BaseMessage]]) -> Iterator[Any]:
...
class BaseMessage(Serializable):
content: Union[str, list[Union[str, dict]]]
"""The string contents of the message."""
(see https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/output_parsers/transform.py#L111 and https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/messages/base.py#L23)
I'm using
ijson
to parse JSON that is coming out of alangchain
(https://www.langchain.com/) pipeline.This pipeline basically is an async iterator over strings (i.e.
str
objects notbytes
, which when joined together give to complete JSON output). But as far as I can tell,ijson
doesn't support parsing from such an generator directly (which is strange for a library that is based on generators and coroutines).So to be able to parse that JSON with
ijson
I have to wrap this generator in a file like object. I'm doing it like this:For an example the pipeline delivers 343 JSON fragments over the course of 34 seconds. I want to show output to the user as soon as the initial JSON snippets arrives (which happens after about one second). This seems to be exactly what
ijson
was made for.However when I try it out, nothing gets output until 34 seconds have passed, and then everything gets output at once. My guess would be that that is caused by the default buffer size used by
ijson
being 64K. And indeed the total size of the JSON output is less that 64K, furthermore when I use a small buffer size when callingijson.basic_parse_async()
like this:I get the desired effect: partial JSON gets output almost immediately.
However this seems to be inefficient for several reasons:
It is an additional layer.
The
read
method has to ensure that the strings it returns have the correct size, so constantly has to slice and reconcatenate string. However I don't care about the input being parsed in 10-character chunks, but I care about the incoming chunks being fed to the parser immediately without being split into tiny chunks should a longer chunk arrive.I don't know how
yajl
is implemented, but I assume when a small chunk size gets used, the C implementation of the parser can't play to its strengths, as the back and forth between Python and C drowns out the performance advantage of the C parser. (For my example the result was that using the pure Python parser is as fast as usingyajl2_c
).I tried to add support for parsing from generators to
ijson
, but I'm really out of my depths here. What I tried to do was:I don't know whether this is the correct approach (the fact that the argument is called
file_obj
suggests that it isn't). I also don't know what the difference betweenbasic_parse
andbasic_parse_gen
is.However that nearly gets me there. I get
i.e. the parser seems to expect bytes. I can work around that problem, by replacing
with
but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser.
(For the C parser this is still needless inefficiency, but that would be harder to fix, since the C parser would have to directly operate on Python unicode data).
So long story short: Is there any change that
ijson
gets the functionality to directly parses from string generator/iterators?