Open pnico opened 11 months ago
Are some operations that were performed asynchronously before, in appengine Python 2, now being performed synchronously, or using a different concurrency approach that might especially disadvantage instances with a low number of cores, etc?
It looks like this does decrease the number of warnings. Is it safe, or helpful?
# in main.py
from google.appengine.runtime import default_api_stub
default_api_stub.URLLIB3_POOL_SIZE = 50
Also, increasing the number of threads seems to help reduce the RPCFailedErrors:
default_api_stub.MAX_CONCURRENT_API_CALLS = 300
This seems sketchy to do.. Is there any way around this? Could this be done without threads, was it using threads before?
I know this might not be helpful, but we saw the same log spam of Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10
. One of the ways to trigger the spam is to use ndb IN
queries, which internally fire multiple requests to datastore concurrently and then join the results. Here's a minimal service to reproduce the bug:
""" Repro service for https://github.com/GoogleCloudPlatform/appengine-python-standard/issues/105
Exposes a single endpoint @ / that fires an IN query
Modify the number of items through URL parameter count, like so:
- {SERVICE_URL}/?count=20
Any value over 10 will cause log spam of:
"Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10"
"""
import google.appengine.api
import google.cloud.logging
from flask import Flask, request
from google.appengine.ext import ndb
app = Flask(__name__)
class TestModel(ndb.Model):
property = ndb.IntegerProperty()
@app.route('/', methods=['GET'])
def test():
count = request.args.get('count', default=11, type=int)
values = list(range(count))
_ = TestModel.query(
TestModel.property.IN(values)
).order(TestModel.key).fetch_page(1)
return ''
client = google.cloud.logging.Client()
client.setup_logging()
app = google.appengine.api.wrap_wsgi_app(app)
What we ended up doing was to temporally filter them out with a logging filter, while we wait for a solution...
@spiqueras it makes sense. There are some places where we can economize the number of RPC calls - for example we use max_entity_groups_per_rpc
because it's supposed to be a performance enhancement (maybe it was, at the time) but it's possible that's irrelevant now that the way RPC calls are processed is apparently very different now anyway. Tweaking threads / concurrent connections does reduce it somewhat too, for what that's worth. I'm not overly concerned with the log spam (from "Connection pool is full"), though, as it doesn't seem to affect functionality. (Edit: the RPCFailed errors are more worrying since they didn't happen before, and now we have to retry lots of operations that just worked almost all the time before, which slows down our application, increases costs etc)
The increased latency and cost to run our application is a concern, to put it mildly - it's fairly dramatic. Are you seeing any of this?
Yes, we've seen similar RPCFailed errors, particularly when writing instances in batch to datastore (more so when these instances are on the bigger side) although they're not as frequent as to pose a problem.
Increased latency/cost is something that we also had to deal with. Tweaking the default gunicorn and reviewing all dependencies have been key to keeping the costs down.
Tweaking the default gunicorn and reviewing all dependencies have been key to keeping the costs down.
With F1 instances this is the best we've come up with so far, but it's still a lot worse than before:
gunicorn -b :$PORT main:webapp --max-requests 5000 -w 1 --timeout=60
Any other tweaks you can recommend? Re. dependencies, is this mainly to conserve memory and startup/import time? We also tried tweaking autoscaling settings, but didn't see much improvement there compared to what we had before.
Re. dependencies, is this mainly to conserve memory and startup/import time?
Indeed.
Any other tweaks you can recommend?
I'm afraid not as we don't use F1. For beefier instance types, we've seen fewer instance crashes by adjusting down the number of workers, up the number of threads per worker and using --preload
.
Ah good to know, we might experiment with instance size to see if it gives better value overall..
Just an update on this though, previously our protobuf
version was limited to <4 because of some issues in unit tests (also this was with an earlier release of appengine-python-standard
), so we thought it was necessary in production too. changing this to 4.25.2 improved latency by a lot, a bit closer to what we had before. From looking at other issues in this repo, I guess this can cause some issues with projection queries, but maybe using <4 will cause it to use the pure-python version? in any case, it seems to be working OK so far and instance counts are lower, fingers crossed. There should maybe be a note in the README about this.
Hi @pnico I'm having the same issue. Can you please share more detailed info how to resolve it? Is it actually working?
Thanks
@toan-le-theout I'm not working on the project that uses appengine any more, but I can say for that project the biggest meaningful difference was fixing the protubuf
version and setting it to 4.25.2, which helped with latency a bit and reduced the number of instances being created. If you're already using a protobuf version >= 4 this doesn't help you, but you can try the gunicorn settings for number of workers and max requests along with appengine max requests and load balancing settings to optimize.
For the logs of "Connection pool is full", I did see a reduction of these by playing with default_api_stub.URLLIB3_POOL_SIZE
and default_api_stub.MAX_CONCURRENT_API_CALLS
- though I hesitate to recommend this, since it just seems like a sketchy thing to do. Also, I suspect that this log should really be considered just a minor warning, since I don't think it actually prevents new threads from being spawned when they're needed, my guess is it just reduces the annoying log spam, which you could choose to just ignore or filter out in your queries looking for errors. I didn't notice any actual performance improvement from messing with these settings, we ended up just calling our "solution" good enough.
For the RPCFailed
errors, we already had a custom decorator to wrap API calls and retry them using a "full-jitter" approach like the one detailed here. Something like this is good to have in general anyway, because you can catch and retry on datastore/transaction failures which happen from time to time in any case. So we just added this error to our existing list of error types that would trigger a retry with a delay using ndb.sleep()
; there are some python libraries for backoff + retry you could look into if you don't want to roll your own, but after looking at these I decided to do a simpler custom function since they seemed like overkill for what we needed in terms of features.
Something I was curious about at the time was the GIL-less Python in development allowing "true" multithreading - idk if this library will ever be updated to support that, but if it happens, that could be something to try in the future, although with small instances / few CPU cores it's hard to know if this would be better or worse.
I noticed today that the other App Engine APIs recently received an update to allow upgrading protobuf
to v5, and also that urllib3
has a v2 which includes a lot of changes around retry behavior. (See #121.) I have no idea if upgrading either of these dependencies would help, but if this library ever follows suit, I'll be curious to try them and see if we get fewer mysterious RPCFailed errors.
We are using F1 instances (2 cores). In Python 3, with no meaningful code changes, we are seeing 1.5-2x median latency across the app, an even greater increase in created/active/billed instances with the same appengine settings, and some new errors and warnings we never saw before. For example, we see this a lot now with datastore operations:
We also see tons of these warnings, not related to any outgoing web calls our app is making: (I know it's just a warning and might be harmless, but it's worrying because of the extremely high volume in our logs):
I've been tweaking the app.yaml various ways, but I can't seem to find a configuration that solves or even significantly reduces either issue. Our app is doing the exact same operations as before with the same load. What can we try? This isn't sustainable, we can't afford this jump in cost and the app is performing much worse now. Our unit tests are running in about 30% less time compared to Python 2 so I didn't expect this, it seems like it could be I/O related?
Not sure if this is relevant, but I noticed this in google.appengine.api.apiproxy_rpc.py:
But then in concurrent.futures.threads.py we see this comment in the constructor:
So the default number of workers would be 6, but apiproxy_rpc.py is setting it to 100 regardless of the actual number of CPU cores available. And, this default has a hardcoded limit of 32, because of reported "consuming surprisingly large resource." What led to the decision to use 100 here? (Edit: is this maybe OK because RPC calls aren't CPU-bound? Should the urllib3 connection pool size be increased to match this?)