ICRAR / ijson

Iterative JSON parser with Pythonic interfaces
http://pypi.python.org/pypi/ijson/
Other
849 stars 51 forks source link

Parsing JSON from string generators/iterators #124

Open doerwalter opened 1 month ago

doerwalter commented 1 month ago

I'm using ijson to parse JSON that is coming out of a langchain (https://www.langchain.com/) pipeline.

This pipeline basically is an async iterator over strings (i.e. str objects not bytes, which when joined together give to complete JSON output). But as far as I can tell, ijson doesn't support parsing from such an generator directly (which is strange for a library that is based on generators and coroutines).

So to be able to parse that JSON with ijson I have to wrap this generator in a file like object. I'm doing it like this:

class AsyncIteratorAsFile:
    def __init__(self, iterator):
        self.iterator = iterator
        self.buffer = ""

    async def read(self, size=-1):
        while size == -1 or len(self.buffer) < size:
            try:
                self.buffer += await anext(self.iterator)
            except StopIteration:
                break

        if size == -1:
            result, self.buffer = self.buffer, ""
        else:
            result, self.buffer = self.buffer[:size], self.buffer[size:]

        return result

For an example the pipeline delivers 343 JSON fragments over the course of 34 seconds. I want to show output to the user as soon as the initial JSON snippets arrives (which happens after about one second). This seems to be exactly what ijson was made for.

However when I try it out, nothing gets output until 34 seconds have passed, and then everything gets output at once. My guess would be that that is caused by the default buffer size used by ijson being 64K. And indeed the total size of the JSON output is less that 64K, furthermore when I use a small buffer size when calling ijson.basic_parse_async() like this:

async for (event, value) in ijson.basic_parse_async(AsyncIteratorAsFile(input), buf_size=10):
    ...

I get the desired effect: partial JSON gets output almost immediately.

However this seems to be inefficient for several reasons:

I tried to add support for parsing from generators to ijson, but I'm really out of my depths here. What I tried to do was:

diff --git a/src/ijson/__init__.py b/src/ijson/__init__.py
index 9f7c679..95dfb1b 100644
--- a/src/ijson/__init__.py
+++ b/src/ijson/__init__.py
@@ -45,6 +45,7 @@ del _default_backend

 basic_parse = backend.basic_parse
 basic_parse_coro = backend.basic_parse_coro
+basic_parse_gen = backend.basic_parse_gen
 parse = backend.parse
 parse_coro = backend.parse_coro
 items = backend.items
diff --git a/src/ijson/common.py b/src/ijson/common.py
index 99b8bc4..798ed17 100644
--- a/src/ijson/common.py
+++ b/src/ijson/common.py
@@ -325,7 +325,7 @@ def _get_source(source):
 def _make_basic_parse_gen(backend):
     def basic_parse_gen(file_obj, buf_size=64*1024, **config):
         return utils.coros2gen(
-            file_source(file_obj, buf_size=buf_size),
+            file_source(file_obj, buf_size=buf_size) if is_file(file_obj) else file_obj,
             *_basic_parse_pipeline(backend, config)
         )
     return basic_parse_gen

I don't know whether this is the correct approach (the fact that the argument is called file_obj suggests that it isn't). I also don't know what the difference between basic_parse and basic_parse_gen is.

However that nearly gets me there. I get

Traceback (most recent call last):
  File "/Users/walter/x/test_ijson.py", line 92, in <module>
    for x in ijson.basic_parse_gen(g()):
             ~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/Users/walter/checkouts/ijson/src/ijson/utils.py", line 55, in coros2gen
    f.send(value)
    ~~~~~~^^^^^^^
  File "/Users/walter/checkouts/ijson/src/ijson/backends/python.py", line 36, in utf8_encoder
    sdata = decode(bdata, final)
  File "<frozen codecs>", line 324, in decode
TypeError: can't concat str to bytes

i.e. the parser seems to expect bytes. I can work around that problem, by replacing

for x in ijson.basic_parse_gen(g()):
    ...

with

for x in ijson.basic_parse_gen(codecs.iterencode(g(), "utf-8"):
    ...

but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser.

(For the C parser this is still needless inefficiency, but that would be harder to fix, since the C parser would have to directly operate on Python unicode data).


So long story short: Is there any change that ijson gets the functionality to directly parses from string generator/iterators?

rtobar commented 4 weeks ago

@doerwalter I won't be able to respond in detail now, but in the meantime I'll point you to #44. I'll try to answer more carefully tomorrow.

rtobar commented 2 weeks ago

Thanks @doerwalter for the wait. I haven't had as much time on my hands as I'd like, but I still wrote the message above to signal that I didn't intend to forget about this issue. Today I finally could sit down and write a proper reply. Sorry for the wall of text, but this is a good opportunity to do some brain dumps and somewhat document things I guess.

Firstly, thanks again for the very detailed message, and presenting the problem very clearly.

Long story short, the main issue to support generators or iterables as inputs is that it cannot be supported in the "generic" routines (see below), as this would break the support ijson currently has for event interception. This feature works based on premise that users can provide a generator to ijson as an input, with the generator returning 2- or 3-tuples (rather than strings or bytes.)

... So to be able to parse that JSON with ijson I have to wrap this generator in a file like object. I'm doing it like this: [code] ... The read method has to ensure that the strings it returns have the correct size...

Just one comment here: ijson never reads with size==-1 (that's the whole point! :wink:), so if you're using this class with ijson only then you can relax your code. The size argument is also usually a maximum, it doesn't have to be the exact number, and users/libraries (and ijson certainly does) should always check the length of the returned data. It wouldn't change your code much I think anyway.

... nothing gets output... ... My guess would be that that is caused by the default buffer size used by ijson being 64K... ... However I don't care about the input being parsed in 10-character chunks ... the back and forth between Python and C drowns out the performance advantage of the C parser... ... For my example the result was that using the pure Python parser is as fast as using yajl2_c

You are correct that in your case the buffer size is what's preventing the parsing process from proceeding, and indeed using a smaller buffer size for your workaround is the way to go, even if that creates more back and forth to/from the underlying parser. In any case, yours is a very small JSON document, so it makes sense that the backend you choose makes no humanly noticeable difference (there sure is a difference, but it will be minimal).

About the back and forth: that's exactly what a strings/bytes iterator input would end up doing too: each you the user yields some bytes/str, we'd pass that through the parser. The main difference is that users don't need to pass a buf_size anymore, as each yielded str/bytes can be of different size, and will be passed to the parser as-is.

I tried to add support for parsing from generators to ijson, but I'm really out of my depths here. What I tried to do was: [code]

That'd be kind of a good first approach, but like I say below there's no way of doing this in the generic entry points. What we need to do is add new specialised entry points.

I also don't know what the difference between basic_parse and basic_parse_gen is

basic_parse_gen only knows how to deal with file-like objects, basic_parse_async only with async file-like objects, and so on. OTOH, basic_parse detects the input type and dispatches to the correct specialised routine, which is why you can do both async for _ in ijson.basic_parse(...) and for _ in ijson.basic_parse(...) from the same routine, depending on the input. The other routines (items, kvitems and parse) follow the same logic.

There's no way (I can see) where we add support for str/bytes iterables as inputs to basic_parse (the generic routine) without breaking support for event interception though, but OTOH event interception is a somewhat niche feature. I'm not sure whether string iterables as inputs would be more or less popular that even interception, so I'm not initially inclined to break one to support the other.

What can be done however, and I hadn't thought about this during #44 I think, is to add a specialised method that deals only with bytes/string iterables as inputs (e.g., basic_parse_iterable or similar). This would have to be done for all methods. The implementation should be (relatively) trivial. The yajl2_c backend is a bit special and might require a bit more of care, but nothing unsurmountable.

the parser seems to expect bytes ... [adds UTF8 encoding] ... but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser

I wouldn't care about the Python parser (it's miles behind the others in terms of performance). And also in general encoding/decoding usually isn't the bottleneck, although it can make a measurable difference (and still by principle you want to be as efficient as possible). But for more reference...

The yajl parsers (and an experimental Boost.JSON backend I baked some time ago) all expect bytes (UTF8-encoded bytes, because that's what JSON should be encoded in), and therefore the library works at its best when users provide bytes, which is also usually what you get from network operations like HTTP traffic, of from reading files -- so it should be a win-win. When users provide file-like sources that return strings on read, we emit a warning and internally UTF8-encode them before passing them to the backend. I'd do something similar for string/bytes iterators: I'd expect them to give bytes preferably, otherwise we'll end up internally encoding them and warning about it.

In the case of the library you're using, does it have any option to return the non-decoded bytes instead? The fact that it is returning strings with fragments of a raw JSON document means it is doing some unnecessary extra work to begin with.

Yes, the Python backend internally uses str for parsing. I settled on that solution because, at least at the time, it was (unintuitively) faster: backends always return strings, not bytes, for string objects in the JSON document, and the python backend was faster when decoding the full JSON document, then extracting those string objects, than when dealing with bytes as input and only decoding the string objects that needed to be given to the user. I have had the intention to undo this, and let the Python backend deal with bytes internally to remove some complexity, even at the expense of degraded performance (again, nobody should expect the Python backend to be fast). The README even used to say (second bullet point in FAQ #5) that this internal implementation detail might change in the future.

rtobar commented 2 weeks ago

@doerwalter just a final remark about the size argument for read: you can even be laxer, and just return await anext(self.iterator, b"") in your read method. Again, ijson will check the size of what you return and work with that.

doerwalter commented 2 weeks ago

Thanks @doerwalter for the wait. I haven't had as much time on my hands as I'd like, but I still wrote the message above to signal that I didn't intend to forget about this issue. Today I finally could sit down and write a proper reply. Sorry for the wall of text, but this is a good opportunity to do some brain dumps and somewhat document things I guess.

Thanks for the detailed reply, this is really appreciated.

Firstly, thanks again for the very detailed message, and presenting the problem very clearly.

For context, this is the feature request I made in langchain related to ijson: https://github.com/langchain-ai/langchain/discussions/27911

Long story short, the main issue to support generators or iterables as inputs is that it cannot be supported in the "generic" routines (see below), as this would break the support ijson currently has for event interception. This feature works based on premise that users can provide a generator to ijson as an input, with the generator returning 2- or 3-tuples (rather than strings or bytes.)

OK, so passing a generator to basic_parse() etc. already means something else. In theroy ijson could check to see what comes out of the generator and dispatch on that, but that would be a very messy API.

... So to be able to parse that JSON with ijson I have to wrap this generator in a file like object. I'm doing it like this: [code] ... The read method has to ensure that the strings it returns have the correct size...

Just one comment here: ijson never reads with size==-1 (that's the whole point! 😉), so if you're using this class with ijson only then you can relax your code. The size argument is also usually a maximum, it doesn't have to be the exact number, and users/libraries (and ijson certainly does) should always check the length of the returned data. It wouldn't change your code much I think anyway.

It feels wrong to me to ignore the size argument. And indeed when I replace:

    async def read(self, size : int = -1) -> str:
        while size == -1 or len(self.buffer) < size:
            try:
                self.buffer += await anext(self.iterator)
            except StopAsyncIteration:
                break

        if size == -1:
            (result, self.buffer) = (self.buffer, "")
        else:
            (result, self.buffer) = (self.buffer[:size], self.buffer[size:])

        return result

with

    async def read(self, size : int = -1) -> str:
        return await anext(self.iterator, b"")

my test script hangs.

... nothing gets output... ... My guess would be that that is caused by the default buffer size used by ijson being 64K... ... However I don't care about the input being parsed in 10-character chunks ... the back and forth between Python and C drowns out the performance advantage of the C parser... ... For my example the result was that using the pure Python parser is as fast as using yajl2_c

You are correct that in your case the buffer size is what's preventing the parsing process from proceeding, and indeed using a smaller buffer size for your workaround is the way to go, even if that creates more back and forth to/from the underlying parser. In any case, yours is a very small JSON document, so it makes sense that the backend you choose makes no humanly noticeable difference (there sure is a difference, but it will be minimal).

Using ijson in the (langchain) context is such a big step forward, that we probably shouldn't have to worry about remaining potential performance improvements, but I was hoping that supporting input from generators would be easy to implement, which doesn't seem like it.

About the back and forth: that's exactly what a strings/bytes iterator input would end up doing too: each you the user yields some bytes/str, we'd pass that through the parser. The main difference is that users don't need to pass a buf_size anymore, as each yielded str/bytes can be of different size, and will be passed to the parser as-is.

Exactly. The advantage would be that we would neither have to feed shorter input to the parser (because size was smaller that the chunk we've got), nor that we would have to wait for further input (because size was larger than the chunk we've got).

I tried to add support for parsing from generators to ijson, but I'm really out of my depths here. What I tried to do was: [code]

That'd be kind of a good first approach, but like I say below there's no way of doing this in the generic entry points. What we need to do is add new specialised entry points.

OK.

I also don't know what the difference between basic_parse and basic_parse_gen is

basic_parse_gen only knows how to deal with file-like objects, basic_parse_async only with async file-like objects, and so on. OTOH, basic_parse detects the input type and dispatches to the correct specialised routine

Ah, OK, now the naming scheme makes sense! ;)

, which is why you can do both async for _ in ijson.basic_parse(...) and for _ in ijson.basic_parse(...) from the same routine, depending on the input. The other routines (items, kvitems and parse) follow the same logic.

There's no way (I can see) where we add support for str/bytes iterables as inputs to basic_parse (the generic routine) without breaking support for event interception though, but OTOH event interception is a somewhat niche feature. I'm not sure whether string iterables as inputs would be more or less popular that even interception, so I'm not initially inclined to break one to support the other.

What can be done however, and I hadn't thought about this during #44 I think, is to add a specialised method that deals only with bytes/string iterables as inputs (e.g., basic_parse_iterable or similar). This would have to be done for all methods. The implementation should be (relatively) trivial. The yajl2_c backend is a bit special and might require a bit more of care, but nothing unsurmountable.

Sounds good to me.

the parser seems to expect bytes ... [adds UTF8 encoding] ... but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser

I wouldn't care about the Python parser (it's miles behind the others in terms of performance). And also in general encoding/decoding usually isn't the bottleneck, although it can make a measurable difference (and still by principle you want to be as efficient as possible). But for more reference...

OK, when we ignore the Python parser, there's really no feasible way around encoding and decoding, since the C parser expects bytes.

The yajl parsers (and an experimental Boost.JSON backend I baked some time ago) all expect bytes (UTF8-encoded bytes, because that's what JSON should be encoded in), and therefore the library works at its best when users provide bytes, which is also usually what you get from network operations like HTTP traffic, of from reading files -- so it should be a win-win. When users provide file-like sources that return strings on read, we emit a warning and internally UTF8-encode them before passing them to the backend. I'd do something similar for string/bytes iterators: I'd expect them to give bytes preferably, otherwise we'll end up internally encoding them and warning about it.

This sounds like a useful approach, except that we could drop the warning, and document the support for mixed str/bytes as a feature. ;)

In the case of the library you're using, does it have any option to return the non-decoded bytes instead? The fact that it is returning strings with fragments of a raw JSON document means it is doing some unnecessary extra work to begin with.

Well the type annotations for the method I'm overwriting in langchains JSON parser suggest that that's not possible:

    def _transform(self, input: Iterator[Union[str, BaseMessage]]) -> Iterator[Any]:
        ...

class BaseMessage(Serializable):
    content: Union[str, list[Union[str, dict]]]
    """The string contents of the message."""

(see https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/output_parsers/transform.py#L111 and https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/messages/base.py#L23)