calpaterson / python-web-perf

Code for testing performance of popular python webservers
106 stars 21 forks source link

Found a couple of bugs #11

Open miguelgrinberg opened 4 years ago

miguelgrinberg commented 4 years ago

Hi, thanks for doing this test.

Wanted to mention a few important bugs that I've found in your tests.

Would you be willing to re-run your tests after you fix these problems? I believe they are severe enough to have the potential to change your results in a big way.

calpaterson commented 4 years ago

Hi, thanks for taking the time to reproduce everything. I was surprised that so few people did that so you are one of the few - thanks for being interested in the truth of it. Some thoughts:

Yes, I did use the normal loop with aiohttp - when I read their docs this was what was advised and it was what I was using at work at the time. Agree that uvloop seems to help in all cases but I didn't know that before I started!

re:psycopg2 and meinheld/gevent. That is interesting and I would like to investigate further. To some extent that is not so consistent with what I saw in monitoring when I did the tests which makes me curious!

I also used an external pooler (pgbouncer) which was necessary in practice for some of the asyncio setups as I (conversely) found that aiopg's internal pooler did not work well - I seem to recall it didn't return connections to the pool properly but as this was some months ago I have now forgotten the details. pgbouncer may have disguised a number of bugs for me.

My reflection is that if I did this again I wouldn't use in-process pooling as it's firstly redundant and secondly each pool only has local knowledge of it's own process so it seems a bit limited. I have to say I'm not really clear why I did in the first place as I don't do that for production apps - I just use pgbouncer.

As a philosophical note I think there is something to be said for raising errors when the connection pool is exhausted rather than just waiting. To some extent this depends on your expectations but I personally feel that an exhausted db connection pool is a suspicious condition in a real world application. YMMV.

I'm sorry to say that I don't have time to redo these tests at the moment. It may be something I revisit further in the future (particularly once Django's async settles down).

Thanks again! Cal

miguelgrinberg commented 4 years ago

re:psycopg2 and meinheld/gevent. That is interesting and I would like to investigate further. To some extent that is not so consistent with what I saw in monitoring when I did the tests which makes me curious!

Well, I'm positive my assessment is correct, so if you did not see a single database connection per worker process, then you must have been using code that is different than what you have in this repository.

The big question is why the meinheld tests did so well in spite of this. And I think I know the answer. The database query that you are using is so simple and quick that there is little/no gain in parallelizing it.

miguelgrinberg commented 4 years ago

@calpaterson I just noticed that you quoted my tweet with results that more or less agree with yours. This was before I've found the issues stated above. My results actually have shifted a bit after I fixed these problems, and I cannot say that they agree with yours anymore. Will publish a blog post with my investigation and results shortly.

calpaterson commented 4 years ago

Ok! I look forward to your blogpost!

calpaterson commented 4 years ago

Ah, I've just seen that you've already posted it. Thanks! Will have a look through when I get some time and perhaps continue the discussion here.

calpaterson commented 4 years ago

Hi Miguel. I'm sorry that I've only just gotten around to reading your article. I'm afraid I don't like it.

One of the key bits of what I wrote was:

Increased throughput is really only useful while latency is within an acceptable range. [emph original]

I returned to this a few times but the crux of it is that high latency variance makes your server chronically wonky - some requests are completed in an acceptable time but others are not. Users typically make a number of requests so they see the 99th percentile response with some frequency - problems manifest here as random timeouts. I talk about the "sysadmin lore" side of this too, for example with Etsy.

The long and short of it is that if your server is showing enormous latency variances under a homogeneous workload it's going to get much, much worse under the normal variance in request times present in a proper webapp. Probably said webapp will just break down, pretty ungracefully, under load.

You've passed over this and just written another throughput benchmark - comparing multiples of throughput and ignoring the fact that you somewhat suspiciously have P99's that evidence enormous latency variance for the async servers and generally not for the sync servers. The "hiccups" explanation you provide isn't convincing. The reason is almost certainly (lack of) fairness in scheduling.

Beyond that I rate some of your changes to the benchmark (removing double connection pooling), am in doubt of some (just adding a constant delay to the SQL queries doesn't really increase fidelity IMO) and think others are decidedly wrong (massively increasing the number of available connections).

miguelgrinberg commented 4 years ago

I'm afraid I don't like it.

I expected as much.

you somewhat suspiciously have P99's that evidence enormous latency variance for the async servers and generally not for the sync servers

I mentioned in the article that the short P99 times occurred randomly on a handful of tests in each run and they were not specifically associated with sync tests, so this line of reasoning is invalid.

just adding a constant delay to the SQL queries doesn't really increase fidelity IMO massively increasing the number of available connections

These two would make the test more I/O bound, which is when async solutions perform better. I'm not surprised that you are against them, but I'm not sure why you think I'm in favor. I just want people to be aware that the choices that you make can be used to make the test say whatever you want, which is exactly what you've done by choosing an ultra-simple database query and a small number of database sessions (though not small enough to start hurting the sync tests as well).

calpaterson commented 4 years ago

I just want people to be aware that the choices that you make can be used to make the test say whatever you want

I think the problem for you is that your own results follow mine on latency variance so you not been successful in saying "whatever you want" but have in fact completed a replication!

You aren't the only one - several people have written to me saying they've done so. Some of them have used the code in this repo and others have used different setups.

miguelgrinberg commented 4 years ago

I think the problem for you is that your own results follow mine on latency variance

They don't. You are ignoring that the tests that gave me P99 times in the 1100ms were both sync and async, and different in every run. It is caused by an external condition, not something that can be attributed to specific tests.

And when I ran the tests configured with choices similar to yours, my throughput results got closer to yours, but I still got those high latencies.

blazespinnaker commented 3 years ago

The fundamental change in @miguelgrinberg seemed to be this:

"The database query issued in the original benchmark was a simple search by primary key. To make the test more realistic I made the query slightly slower by adding a short delay to it. I understand that this is an extremely subjective area and many will disagree with this change, but what I observed is that with such quick queries there wasn't much opportunity for concurrency"

I think that's a reasonable change for those who have applications that can't cache/index properly (they exist), but in my experience is not the most common case. Caching and proper indexing is perhaps the most important thing you can do when creating high performant web applications.

Also, maybe I'm missing something but "there wasn't much opportunity for concurrency" .. how could that be accurate as threads will be blocking on IO and will require significant concurrency / context switching. Are the sync frameworks executing cooperative concurrency under the hood? If so, then I guess the results make sense, but they're really just more mature/ optimized async IMHO.

Perhaps I misunderstood something else, but the p99 latency for sync frameworks that @miguelgrinberg reported were far superiors to the async ones.

@calpaterson says "while throughput can be improved by adding machines, latency under load doesn't get better when you do that." I don't think it can be said any better than that. (At least in my experience with high volume, highly performant web applications) Quality matters.

That all said, cooperative concurrency is certainly more efficient, faster and more reliable than preemptive - theoretically. Nginx, for example, has proven that fairly extensively over the years. I can't imagine there is much doubt around that. It would be very interesting to figure out what's causing the lack of fair execution here.

miguelgrinberg commented 3 years ago

The fundamental change in @miguelgrinberg seemed to be this:

I have also fixed bugs in the benchmark as shown in this repository. Concurrency for greenlet tests is disabled, while my tests made it work. Also the aiohttp test in this repo is using the default asyncio loop, but his article indicates that all tests used uvloop, which is misleading. I have really used used uvloop for all asyncio tests.

Caching and proper indexing is perhaps the most important thing you can do when creating high performant web applications.

Proper indexing does not help when you need to perform joins across several tables (or maybe I should say, it does help, but joins are still expensive operations). This test uses a single table and accesses rows by their primary key. That is completely unrealistic. I've added a configuration variable that allows you to see how the benchmark results change as queries get longer. I'm not saying all applications have long database queries, I'm just saying that some do, and for those, the test as implemented in this repository is not a good representation. If you read my article you'll find that I've run the benchmark with different database query durations to see how that affects the results. That's the whole point of my article.

Also, maybe I'm missing something but "there wasn't much opportunity for concurrency" .. how could that be accurate as threads will be blocking on IO

There are no threads in this benchmark. The sync tests run one thread per process, the only concurrency is across processes and is provided by the OS. The async tests also run one thread per process and the OS manages inter-process concurrency (through there isn't much to do here because I'm running one process per CPU). The comment that I've made is about the cooperative concurrency in greenlet and asyncio tests, which is managed by the async event loop. My point is that with these database queries being so quick, there isn't a great opportunity to parallelize async tasks against each other. When I ran my benchmark with longer db queries the async tests improved their performance compared against sync tests.

Perhaps I misunderstood something else, but the p99 latency for sync frameworks that @miguelgrinberg reported were far superiors to the async ones.

I mention this in the article, and also above. The variation in P99 numbers was not associated with async tests, slow P99s were distributed randomly across sync and async tests. I just happened to capture one of the runs that could suggest that only async tests were getting slow P99 grades (although if you look carefully you'll see that some of the sync tests got them too!).

In any case, to help you guys not be stuck on that, I have added a second run of my benchmark to my article. Hopefully this will help you realize that the P99 issue is not associated with particular tests. If/when I figure out what it is, I would be happy to update my benchmark once again, but given that it is not associated with any particular tests, I don't find it a top priority to debug it. FWIW, the P98 numbers in my benchmark did not show this large variation.

That all said, cooperative concurrency is certainly more efficient, faster and more reliable than preemptive - theoretically. Nginx, for example, has proven that fairly extensively over the years.

Cooperative concurrency works better for I/O bound processes. It does not give any improvement for CPU bound processes. Nginx is heavily I/O bound, as all it does is read and write sockets and files.

blazespinnaker commented 3 years ago

"There are no threads in this benchmark"

Right, thanks, "processes", which is the source of @calpaterson contention that the OS does a better job of managing fair execution than fresh asyncio python code.

However, that is still going to involve significant context switching and concurrency (from the OS) as each sync process potentially blocks on io.

I remain skeptical about the db change and its validity. Where I work we specifically denormalize anything that might get commonly called to keep our p99 down. Low latency matters, even at the cost of throughput, as great user experiences translates to more transactions and higher revenue / profit. Google "why latency matters" to learn more.

We have slow REST calls for sure, but they comprise about 5-10% of our traffic at best.

It'd be good to share your p98 for both tests if you have them. It would be very reassuring and informative if they perform consistently that much better in async.

blazespinnaker commented 3 years ago

If you're running on linux, tc is useful for testing blocking io. https://unix.stackexchange.com/questions/28198/how-to-limit-network-bandwidth

Slow db isn't realistic to me, but slow user connections definitely are.

If I get some time, I'll run a couple of the tests. Any good recommendations on the two frameworks I should run? (sync versus async)

miguelgrinberg commented 3 years ago

@blazespinnaker I'm not really sure exactly what you are after, but in any case, your perseverance made me go back to this and investigate those P99s a bit more. Turns out that running the benchmark on a different computer did not show those 1100ms or so P99 times. I have now updated my article to show the new results that I've got (which other than much better P99 times continue to show the same trends).

That said, I feel you are missing the point of my article. Neither my test nor Cal's are going to give you what you need. Picking a framework based on some benchmark you found on the Internet is not something I would recommend. My advice would be that you choose the framework that makes you most productive. That matters way more than millisecond differences in P99s!

blazespinnaker commented 3 years ago

Actually, I'm not picking a framework and not sure how you arrived at that assumption.

And honestly, even if I was, I really wouldn't use python for a backend, tbh. If I were working on a new project, it would likely be nodejs or java depending on the skillsets, level of security / enterprise integration required. Python really has a long ways to go before it can match performance of java and doesn't have the benefit of fullstack javascript.

However, there are other problem sets which benefit greatly from Python of which I'm interested in. These problems sets have high levels of IO / concurrency and I'm trying to decide between using the multiprocessing module versus async.

So I am very interested in any issues around performance concerns around async and python. I am trying to understand your statements around concurrency as they are not aligning with my understanding as there should be a great deal of context switching in the sync scenario because of blocking on IO. Similarly for the async code, which should also be non-blocking on IO.

I am interested in reproducing results and was hoping for some guidance on which tests to narrow down on. I would also like to investigate at a deeper level some of the tuning decisions made, which will be easier to do if I just zero on 1 async and 1 sync framework to repro.

It's possible that because of how the tests were set up and ran, neither scenarios are realistic because IO all comes in fully buffered / ready to execute and there is minimal blocking on load generator IO, which is not that realistic from my experiences. There are also some questions around connection keepalive that I'd like to verify while I'm doing this. These issues are not just edge cases, but could have a profound impact depending on how the underlying implementations are handling IO.

I would also like to run the tests with @calpaterson approach to minimal backend latency but with the tuning improvements that were made.

kigawas commented 3 years ago

@calpaterson You make me feel that sometimes it's harder for someone to admit they are wrong than be killed 😃

dnp1 commented 2 years ago

@blazespinnaker I could not understand why you disagree with "database query sleep". I totally agree with @miguelgrinberg, no matter how optimized database queries can take longer than 1 ms, besides that realistic application have different endpoints and understand how each endpoint affect other is interesting, so at my tests I use function to define the "sleep" time.

def get_sleep(i: int) -> float:
    if i % 100 == 0:
        return .8
    if i % 20 == 0:
        return .2
    if i % 10 == 0:
        return .05
    return .01

If waiting I/O is not a core part of your application, does not make sense, use asyncio at all (or node)

DB here is not the matter of analysis, it is just an example of a network call with limited number of concurrency (number of connections). An api can call other apis (AWS services, internal apis, external apis, etc), etc.

I really missed docker-compose environment + a measure of memory usage and cpu.

At real world we'll usually run over a kubernetes and cost is a real-world constraint.