asyncdef / aitertools

Async versions of the Python itertools features.
Apache License 2.0
20 stars 4 forks source link

Comparing my take #1

Closed ryanhiebert closed 6 years ago

ryanhiebert commented 8 years ago

I started sketching up my own version of async itertools before I realized this existed, and partially due to NIH, and desire to learn.

https://github.com/ryanhiebert/aitertools/blob/master/src/aitertools.py

I'm curious what you think of my approach, especially with regards to my coroutine_iterator.

I felt that these functions should not try to guess whether the arguments are async or not, so I instead added a couple helper functions: to_aiter and to_async, to wrap their synchronous counterparts. This is different from yours, where the arguments are allowed to be regular functions and iterables, but still be used as if they were asynchronous. I think keeping them separate forces the user to understand the difference more, and the split is challenging enough that enforcing that separation seems like a good idea to me.

Because I can see people wanting to use both itertools and aitertools in the same module, I chose to prefix all of the functions with a, except for functions that are only in aitertools. This somewhat mirrors how itertools named it's zip izip to avoid overriding the builtins.

So far what I have is tested, but not documented. I'm not sure if our ideas can be brought together to join these into one, but I thought I'd start the conversation, and see where it goes.

kevinconway commented 8 years ago

I really appreciate you reaching out and taking a look at the code. I've tried to put together some thoughts here in answer to your questions:

I think keeping them separate forces the user to understand the difference more, and the split is challenging enough that enforcing that separation seems like a good idea to me.

Ideologically, I completely agree with having a clear and explicit separation of sync and async code. I think it makes it clear to the developer that the code will CPU block instead of yielding, and that is an important detail to make apparent. I wrote this version of aitertools to adapt sync iterators because I found the explicit separation to be unreasonably verbose when interacting with 3rd party code. Some history:

I originally wrote this to help facilitate another project where I was adapting Werkzeug to async. It's not difficult to adapt an asyncio HTTP server to fit something close to the WSGI protocol. The major difference is that the input and error streams become async streams instead of sync ones. The goal of the other project was to subclass all the components in Werkzeug that touched the stream and maintain a set of async extensions for that library.

Sync streams in Python implement the iterable/iterator protocols and Werkzeug heavily leverages that interface. Depending on which optimization is being used internally, a call to a Werkzeug function that touches the input stream may return a tuple, list, file-like object, stream, or itertools object built from any of those other objects. The number if statements and type checks required to accurately determine exactly which kind of iterable is being passed around was exhausting and brittle. Eventually, I decided to adapt all emitted iterables to async iterables so that consuming code could standardize on that interface and let the producing code handle the lesser cost of adapting its output.

Because I can see people wanting to use both itertools and aitertools in the same module, I chose to prefix all of the functions with a

I think this is a simple difference in style. I chose not to prefix with 'a' so the module's interface was closer to that of itertools. The visual distinction comes from accessing via the module (aitertools.chain vs itertools.chain). I don't think there's any way to say that one is better than the other; just two ways to achieve the same thing.

So far what I have is tested

If you haven't already, I would recommend pulling additional test cases from the standard lib itertools test suite. I've been porting them over as I implement each feature to make sure it produces the same output as the standard lib.

I'm not sure if our ideas can be brought together to join these into one

I haven't had a chance to do a code review of your project yet. I'll post another blerb when I do and we can spitball some ideas.

ryanhiebert commented 8 years ago

Thank you so much for replying, I appreciate your willingness to start a conversation. Open source work is pretty thankless at times, so I try not to expect it of others.

The Werkzeug / Async WSGI project sounds pretty neat. Though I'm not sure I'd build something new taking WSGI as a pattern to follow ;-).

I agree that producing async iterables everywhere is the right way to do things. It certainly makes the interface easier. It's not clear to me if you believe this has strengthened your choice to make aitertools take normal iterables as arguments or diminished it; I think it would diminish it in my mind.

I think this is a simple difference in style. I chose not to prefix with 'a' so the module's interface was closer to that of itertools. The visual distinction comes from accessing via the module (aitertools.chain vs itertools.chain). I don't think there's any way to say that one is better than the other; just two ways to achieve the same thing.

Agreed. I tend to prefer importing the functions from itertools so that they aren't namespaced when I use them, and I figured I'd like to do the same thing for aitertools. Writing the tests, where I've only import aitertools, has confirmed that I much prefer not having to prefix everything with aitertools.

If you haven't already, I would recommend pulling additional test cases from the standard lib itertools test suite. I've been porting them over as I implement each feature to make sure it produces the same output as the standard lib.

I hadn't, but thank you for the pointer. I'll definitely have to look at those further, because it makes it clear what edge-cases they felt were necessary. I do want to note that I don't really care to produce the same output as the normal itertools module, if it's not convenient or necessary. I wrote a backport of the csv module, and it was a pain to get all the error messages to line up, etc. I think that aitertools, while taking heavy inspiration from itertools, is distinct enough to warrant doing things differently if it's helpful. There will be new functions that really only apply to async iterators, such as async iterator scheduling.

One other thing that I'd like your take on is my use of asyncio (especially asyncio.gather) in these functions. It seems like it would be less than ideal to tie these functions to asyncio over any alternative event loop, but I'm not sure how to get the benefit of gather while still being event-loop agnostic. I hadn't initially thought of that issue, which is why I wrote it how I did, but now I see that issue, and I'm wondering if you have any suggestions for addressing it.

kevinconway commented 8 years ago

Though I'm not sure I'd build something new taking WSGI as a pattern to follow

Trying to build out an aWSGI stack was an experiment to see how much effort it would take to make tools that everyone already knows and loves, like werkzeug and flask, and adapt them to asyncio. I had hoped that I could come up with a set of simple principles by which I could refactor existing WSGI oriented code and make it asyncio compatible. I still think it can be done, but trying to accomplish it in werkzeug was like fighting a hydra. Every time I found a stream manipulation and async'ed it I would find myself stuck searching down every consuming function and asyncing it; lather, rinse, repeat until the repo starts looking like an entire fork rather than an extension. At this point I'm more inclined to focus on porting Twisted code and leaving sync style stuff behind.

I agree that producing async iterables everywhere is the right way to do things. It certainly makes the interface easier. It's not clear to me if you believe this has strengthened your choice to make aitertools take normal iterables as arguments or diminished it; I think it would diminish it in my mind.

I'm still somewhat conflicted. For large data sets in sync structures, like tuples or lists, that get adapted the blocking hit that async for would have could be very unexpected. I would prefer that the distinction in code be obvious. However, I haven't found a good way to solve combining sync and async iterables other than adapting them both to the async interface. A common pattern I saw in werkzeug stream management, for example, was a non-lossy read ahead on the stream. The function would read a line from the stream, trigger some behaviour based on the content, and return a chain of the line and the original stream.

I also question whether I should lose any sleep at all over having CPU blocking async iterables. The behaviour of native coroutines with await and generator coroutines with yield from is virtually identical. Having the word async in the mix doesn't actually guarantee async, it only offers it as a possibility. This is a big source of confusion and contention on the mailing list and #python. I don't believe the technology provides for an expectation of asynchronicity beyond the connotation of the keyword the core developers chose to identify native coroutine functions.

your take on is my use of asyncio (especially asyncio.gather)

I had a chance to look over your project. Other than how we chose to manage state (class vs closure), some of the implementations are identical. Hopefully that means we're both "doing it right". As far as asyncio, I share your concern with leveraging it. I'm not sure how many other stacks will spin up to compete with asyncio or how many of them will be able to gain enough mass to attract developers away from the standard library. Even so, I think being framework agnostic is a good thing.

Regarding asyncio.gather, I can see the convenience of it but I think the value add is minimal compared to the cost of binding to asyncio. I think you could easily refactor that code to not leverage it if you wanted to:

@coroutine_iterator
async def azip(*aiterables):
    aiters = await asyncio.gather(*tuple(map(aiter, aiterables)))
    async def __anext__():
        return tuple(await asyncio.gather(*tuple(map(anext, aiterables))))
    return __anext__

@coroutine_iterator
async def azip2(*aiterables):
    aiters = tuple(await aiter(aiterable) for aiterable in aiterables)
    async def __anext__():
        return tuple(await anext(aiter) for aiter in aiters)
    return  __anext__

I didn't run that code through a PY35 interpreter to make sure it is valid. Just wanted to help with an example.

ryanhiebert commented 8 years ago

I've been following the async / yield discussions going on in Rust, and that leading me to coroutines in C++, and one interesting thing that it seems like they're doing (though I'm not 100% sure I've understood correctly) is that they've been able to make it so that the compiler can notice that a coroutine is being used synchronously, and eliminate the overhead of the coroutine. Very cool stuff, and I haven't figured out how it all works yet, but there seems to be some exciting stuff there for avoiding this obnoxious splitting of the world / colored functions.

AFA using tuple() instead of asyncio.gather, the problem is that it's only going to work on one value at a time, even though all of those item could be done simultanously (for instance if they are network requests, etc). Even though the run loop could do some other stuff, this function wouldn't be able to benefit at all from it, which defeats the purpose.

It was an interesting realization to understand that all this is based on generators. Even though I already knew that, it made it clear that the default mode is still single threaded, and not parallel at all. The schedule function I have there is nothing more than zip (which uses gather for each set), discarding the schedulers value. Because nothing will move on until the next iteration comes, it can effectively be used to slow things down, or to control the timing. It's both expected, and in some fundamental ways, not.

Now that I've worked with async a bit, I find myself liking the rejected codef PEP, because it allowed a single object to implement both a normal callable interface, as well as a coroutine calling interface. It was a very pretty design for making the colored functions problem less problematic. It had some places where it seemed strange too, but that polymorphism was a really cool idea.

kevinconway commented 8 years ago

AFA using tuple() instead of asyncio.gather, the problem is that it's only going to work on one value at a time

This is something I hadn't given much thought to. Now that I have it makes me even more disappointed in asyncio.

A few years ago I worked on a series of JavaScript projects similar to this one. Having the event loop integrated into the runtime and having the language standardized on error first callbacks meant I could support sequential and fan out forms of evaluation in all environments without any additional work. Having the event loop exist in user space for Python doesn't make it technically impossible to do the same thing, but I believe the asyncio implementation does. Specifically, the asyncio Future implementation appears to make this an unreasonable feat.

I'd be interested in hearing your thoughts on this. I think the only way to make the fan out pattern work would be to either 1) dedicate the library to a specific event loop or 2) implement a lightweight coroutine manager similar to the asyncio Task embedded in the project to allow for concurrent coroutine pumping for cases like zip. This secondary coroutine manager, however, would need to both consume and emit some kind of Future/Promise/Deferred. This is where the asyncio implementation becomes an issue.

I haven't seen the commit history or discussion around why this decision was made, but asyncio does not use a Future implementation that is API compliant with concurrent.futures. If all event loops in Python that supported coroutines were required to use concurrent.futures.Future then there wouldn't be an issue. Any coroutine manager could handle and emit Future objects that were compatible with all other coroutine managers. The asyncio project immediately breaks interoperability with anything not using the asyncio Future because it 1) violates the Future interface and 2) is tightly coupled with the event loop by shoving callbacks onto the time based call queue rather than executing them in band.

The only other option I can think of is to make fan out an optional feature that is only supported if the developer can provide an implementation of gather. The default, event loop agnostic behaviour would be sequential resolution of values. If given an async def as an optional argument zip could switch to using it instead. The assumption is that any given implementation is compatible with the event loop under which the tools are being used which I think that's a very reasonable assumption to make.

ryanhiebert commented 8 years ago

The problems you have are inherent in the low-overhead approach that Python has taken in it's asyncio implementation. I actually find it to be quite innovative, though it also is more low-level, and sometimes more annoying. For instance, there's no way that a system language like Rust would be okay with adding a universal run loop for asyncio. In this way, Python's model is more portable.

It means that users have flexibility of choosing a run loop implementation, and sometimes that flexibility can come with trade-offs. I do think they make the right trade-offs, but it's also clear that async programming is really hard, even in the best case.

https://groups.google.com/forum/#!topic/python-tulip/1CbjmZTXINM

That doesn't give much in the way of reasoning behind the change between concurrency and asyncio futures, but it does hint at it: it shouldn't require threads. In Javascript, the bottom of concurrency relies on magical functions, provided by the engine, that will run stuff on another (invisible) thread, then make a callback to the main thread's event loop. It's the magic that Python doesn't want.

One thing that I've recently figured out, and I think is cool, is that using the await keyword doesn't necessarily yield anything back to the run loop. That's because await is based on yield from, which gets encapsulated in a generator. The generator calling code doesn't know that it even went into a sub-generator.

So if we have these two async functions:

async def a():
    return await b()

async def b():
    return 'Hello World!'

Once the final generator (a().__await__()) is started using next(), it'll raise StopIteration('Hello World!'). It's obvious now that I see it, but I assumed that anything with the await keyword would definitely introduce a break point for other stuff to run. Turns out that's not the case, which is actually pretty cool.

I think that for gather we're probably going to need to just declare that it must be asyncio compatible. I believe that it was a design goal that asyncio be compatible with other approaches as much as possible, so that an asyncio.Future can be used by other libraries. I haven't confirmed this to be the case yet, but I think that's likely. We could parameterize gather, which is the only alternative I see at the moment to avoid the asyncio dependency, but I'm not sure that's the right choice.

If we choose to parameterize, and we don't need to, we'll end up breaking backward compatibly so that we can default to an parallel option, since we'd likely be defaulting to a simple unparallel version to avoid the asyncio import. If we don't parameterize, but we need to, then the library will end up being coupled needlessly to asyncio, even though it's through the default parameters, so we'll probably end up breaking backward compatibility that way too.

Either way, I guess we need to hold off making either choice until we understand the trade-offs.