Closed ntninja closed 4 years ago
@alexander255 I have been working on a set of changes implementing this, but they are not fully fleshed, so I haven't made them available for testing publicly yet.
The main idea is to decouple I/O from the parsing logic, handling only buffers of data as input, and have these new methods be the building-blocks of the rest of the library. I am doing this using coroutines (as in "generators with (yield)
expressions, not python3.5+-specific async
coroutines) which expect data to be sent to them, work on it, and then send their results to another couroutine. These coroutines can effectively be chained together, and also allow for user-provided targets. So all in all the current pull pattern gets inverted into a push pattern at the low level. The current set of "pull" generators (basic_parse
, parse
and items
) is then easily built on top of that.
All of this is working fine for all backends with no drop in performance, with the sole exception of the C backend, which I haven't finished porting yet (but it's about 50% done). Apart from that I also have a first version of async for
-enabled support functions (basic_parse_async
and so on) for all backends which will hopefully make it easier to use ijson with async I/O. Other constructs like the one you suggest can be easily added as well with this new framework.
I expect all these changes will become ijson 3.0, and hopefully t won't take much more longer to finish.
@rtobar: That sounds great!
(The interface in my example was just an illustration on the general concept to get the general idea across, but what you proposed is much better thought out anyways.)
@alexander255 the very first version on this work is finally available. If you could, would you mind trying out the code in the new coros
branch? Is should have new *_async
methods when using python 3.5+, which receive an asyncio file-like object, and that can be used in an async for
construct. I haven't updated the README yet with how to use the new methods, but you can see the tests_asyncio.py
file for simple examples.
New interfaces now exposed on the {{master}} branch. Have a look at the README file to learn more about this. There is a release candidate up in PyPI too (3.0rc1) with these changes, so people can go ahead and test them and provide feedback before a final 3.0 release is done.
Sorry about only looking at this now. While I haven't used it yet, the offered push interface will definitely work for our needs, so thank you about that!
Some feedback: It looks like the stream reading code of the *_async
variants are specific to asyncio
's SteamReader interface through (they use its definition of .read()
to grab more data).
Unfortunately our library exclusively uses trio
which has a slightly different interface so I cannot reuse this code although it would otherwise be a perfect fit for what we're trying to do. As such, could you I convince you to change the line:
data = await self.f.read(self.buf_size)
in utils35.py to this:
if hasattr(self.f, "receive_some"): # Trio's ReceiveStream
data = await self.f.receive_some(self.buf_size)
else: # AsyncIO's StreamReader + Trio's Async Files + CurIO's SocketStream + CurIO's FileStream
data = await self.f.read(self.buf_size)
I know it's dumb to have exactly one outlier in that list, but it's just the way things are right now… The difference in interface does make some sense when you view it purely from the perspective of what the interface represents within the trio system (some random thing that gives you data) – from the PoV of somebody trying to be ecosystem agnostic it's of course a totally unnecessary complication, but when you (as they did/do) only have trio consumers in mind I think their decision was by itself quite understandable.
@alexander255: let me ask the inverse question: would it be possible for you to alter your trio objects so they have a read
method that behaves like receive_some
?
f = the_trio_file_type()
f.read = f.receive_some
async for object in ijson.items_async(f, ''):
pass
If it's not possible to add members to trio objects, a thin wrapper would also do.
On the other hand, I'm the first to admit that I'm not very well versed on the asyncio
ecosystem, so I'm not sure what's the most popular thing out there. If it turns out there is a considerable amount of users who have receive_some
-like objects it would make sense to add the extra support, but for now I think I'll refrain for doing so. As you point out, having one exception is not great if we want to be agnostic to whatever people are using. And adding one exception will also set a precedent for adding more, which again is not great.
@rtobar: Obviously I can wrapper these up, but it's just not the same to having it just work without any further ado. I don't blame you for that decision though, I'd probably done the same in your stead. :slightly_smiling_face:
Let me just request some tiny amendment however: Could you rename the *_async
methods to *_asyncio
to make it clear which AIO framework they target? Just to avoid surprises, etc.
PS: Regarding AIO frameworks, I think it's very unlikely that we are going to see more of them anytime soon. asyncio
is, and will be for a long time, the go-to solution for AIO in Python. trio
is the more innovative hidden gem that tries new things. curio
has mostly be superseded by trio
at this point, save for cases were one has very stringent performance needs (but still use Python for some reason).
@alexander255 the new name is a very good suggestion, and easy to get done. I'll do it for the next rc. Thanks for the clarification on frameworks too, one day I might need to delve deeper into that world.
Mmmm... actually thinking about it again I'm not sure anymore... the async
suffix already matches the fact that they have to be used in async def
functions, and most usually inside async for
statements, so having them end with asyncio
would break that. I tried looking for examples of libraries that offered the two flavours of their methods, but a quick search didn't yield relevant results.
@rtobar: All the AIO frameworks use async def
functions – it's the core primitive they all depend on. My suggestion on using the _asyncio
suffix is based on the fact that your code does not expect just any async def
function/iterable, but rather the specific data stream interface exposed by the asyncio
library. That doesn't mean it can't work with other stuff as well, I've brought up plenty of examples above, but that is the API it is targeting and guaranteed to be compatible with.
@alexander255 yes, I understand that while there are multiple frameworks/APIs all of them share the same underlying mechanism for declaring coroutines (async def
), and that the new methods are designed to work in particular with asyncio
-like objects.
The point I was trying to make was different: XYZ_asyncio
would mean "this is the version of XYZ that works with asyncio (hence it's asynchronous)", but the meaning I what to convey with the XYZ_async
naming is "this is the asynchronous version of XYZ", and not to reflect the type of input they accept. The original "XYZ" names actually don't hint as to what type of object they work with, and even though they work with objects behaving like the classes in the io
module they are not called XYZ_io
.
On top of that, in both XYZ
and XYZ_async
cases, I've also been thinking on adding support for accepting generators too (sync and async, respectively), yet another reason for not binding the particular type of input into the names.
Dealing with asyncio streams I was wondering how realistic it would be to make this awesome library into something that accepts a push pattern.
Thinking about something like:
The current system only supports blocking I/O and there is no way to emulate the above without using threads and pipes/queues unfortunately.