ICRAR / ijson

Iterative JSON parser with Pythonic interfaces
http://pypi.python.org/pypi/ijson/
Other
830 stars 51 forks source link

Expose push / sax-like interface #22

Closed ntninja closed 4 years ago

ntninja commented 4 years ago

Dealing with asyncio streams I was wondering how realistic it would be to make this awesome library into something that accepts a push pattern.

Thinking about something like:

import ijson

def on_event(event, value):
    ...  # Do something

async def collect_json(source):
    parser = ijson.Parser(on_event)
    async for chunk in source:
        parser.feed(chunk)
    parser.feed_eof()

The current system only supports blocking I/O and there is no way to emulate the above without using threads and pipes/queues unfortunately.

rtobar commented 4 years ago

@alexander255 I have been working on a set of changes implementing this, but they are not fully fleshed, so I haven't made them available for testing publicly yet.

The main idea is to decouple I/O from the parsing logic, handling only buffers of data as input, and have these new methods be the building-blocks of the rest of the library. I am doing this using coroutines (as in "generators with (yield) expressions, not python3.5+-specific async coroutines) which expect data to be sent to them, work on it, and then send their results to another couroutine. These coroutines can effectively be chained together, and also allow for user-provided targets. So all in all the current pull pattern gets inverted into a push pattern at the low level. The current set of "pull" generators (basic_parse, parse and items) is then easily built on top of that.

All of this is working fine for all backends with no drop in performance, with the sole exception of the C backend, which I haven't finished porting yet (but it's about 50% done). Apart from that I also have a first version of async for-enabled support functions (basic_parse_async and so on) for all backends which will hopefully make it easier to use ijson with async I/O. Other constructs like the one you suggest can be easily added as well with this new framework.

I expect all these changes will become ijson 3.0, and hopefully t won't take much more longer to finish.

ntninja commented 4 years ago

@rtobar: That sounds great!

(The interface in my example was just an illustration on the general concept to get the general idea across, but what you proposed is much better thought out anyways.)

rtobar commented 4 years ago

@alexander255 the very first version on this work is finally available. If you could, would you mind trying out the code in the new coros branch? Is should have new *_async methods when using python 3.5+, which receive an asyncio file-like object, and that can be used in an async for construct. I haven't updated the README yet with how to use the new methods, but you can see the tests_asyncio.py file for simple examples.

rtobar commented 4 years ago

New interfaces now exposed on the {{master}} branch. Have a look at the README file to learn more about this. There is a release candidate up in PyPI too (3.0rc1) with these changes, so people can go ahead and test them and provide feedback before a final 3.0 release is done.

ntninja commented 4 years ago

Sorry about only looking at this now. While I haven't used it yet, the offered push interface will definitely work for our needs, so thank you about that!

Some feedback: It looks like the stream reading code of the *_async variants are specific to asyncio's SteamReader interface through (they use its definition of .read() to grab more data). Unfortunately our library exclusively uses trio which has a slightly different interface so I cannot reuse this code although it would otherwise be a perfect fit for what we're trying to do. As such, could you I convince you to change the line:

data = await self.f.read(self.buf_size)

in utils35.py to this:

if hasattr(self.f, "receive_some"):  # Trio's ReceiveStream
    data = await self.f.receive_some(self.buf_size)
else:  # AsyncIO's StreamReader + Trio's Async Files + CurIO's SocketStream + CurIO's FileStream
    data = await self.f.read(self.buf_size)

I know it's dumb to have exactly one outlier in that list, but it's just the way things are right now… The difference in interface does make some sense when you view it purely from the perspective of what the interface represents within the trio system (some random thing that gives you data) – from the PoV of somebody trying to be ecosystem agnostic it's of course a totally unnecessary complication, but when you (as they did/do) only have trio consumers in mind I think their decision was by itself quite understandable.

rtobar commented 4 years ago

@alexander255: let me ask the inverse question: would it be possible for you to alter your trio objects so they have a read method that behaves like receive_some?

f = the_trio_file_type()
f.read = f.receive_some
async for object in ijson.items_async(f, ''):
   pass

If it's not possible to add members to trio objects, a thin wrapper would also do.

On the other hand, I'm the first to admit that I'm not very well versed on the asyncio ecosystem, so I'm not sure what's the most popular thing out there. If it turns out there is a considerable amount of users who have receive_some-like objects it would make sense to add the extra support, but for now I think I'll refrain for doing so. As you point out, having one exception is not great if we want to be agnostic to whatever people are using. And adding one exception will also set a precedent for adding more, which again is not great.

ntninja commented 4 years ago

@rtobar: Obviously I can wrapper these up, but it's just not the same to having it just work without any further ado. I don't blame you for that decision though, I'd probably done the same in your stead. :slightly_smiling_face:

Let me just request some tiny amendment however: Could you rename the *_async methods to *_asyncio to make it clear which AIO framework they target? Just to avoid surprises, etc.

PS: Regarding AIO frameworks, I think it's very unlikely that we are going to see more of them anytime soon. asyncio is, and will be for a long time, the go-to solution for AIO in Python. trio is the more innovative hidden gem that tries new things. curio has mostly be superseded by trio at this point, save for cases were one has very stringent performance needs (but still use Python for some reason).

rtobar commented 4 years ago

@alexander255 the new name is a very good suggestion, and easy to get done. I'll do it for the next rc. Thanks for the clarification on frameworks too, one day I might need to delve deeper into that world.

rtobar commented 4 years ago

Mmmm... actually thinking about it again I'm not sure anymore... the async suffix already matches the fact that they have to be used in async def functions, and most usually inside async for statements, so having them end with asyncio would break that. I tried looking for examples of libraries that offered the two flavours of their methods, but a quick search didn't yield relevant results.

ntninja commented 4 years ago

@rtobar: All the AIO frameworks use async def functions – it's the core primitive they all depend on. My suggestion on using the _asyncio suffix is based on the fact that your code does not expect just any async def function/iterable, but rather the specific data stream interface exposed by the asyncio library. That doesn't mean it can't work with other stuff as well, I've brought up plenty of examples above, but that is the API it is targeting and guaranteed to be compatible with.

rtobar commented 4 years ago

@alexander255 yes, I understand that while there are multiple frameworks/APIs all of them share the same underlying mechanism for declaring coroutines (async def), and that the new methods are designed to work in particular with asyncio-like objects.

The point I was trying to make was different: XYZ_asyncio would mean "this is the version of XYZ that works with asyncio (hence it's asynchronous)", but the meaning I what to convey with the XYZ_async naming is "this is the asynchronous version of XYZ", and not to reflect the type of input they accept. The original "XYZ" names actually don't hint as to what type of object they work with, and even though they work with objects behaving like the classes in the io module they are not called XYZ_io.

On top of that, in both XYZ and XYZ_async cases, I've also been thinking on adding support for accepting generators too (sync and async, respectively), yet another reason for not binding the particular type of input into the names.