Elevated latency, instance count and new errors/warnings compared to Python 2 runtime

pnico commented 11 months ago

We are using F1 instances (2 cores). In Python 3, with no meaningful code changes, we are seeing 1.5-2x median latency across the app, an even greater increase in created/active/billed instances with the same appengine settings, and some new errors and warnings we never saw before. For example, we see this a lot now with datastore operations:

...
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/datastore/datastore_query.py", line 2949, in __query_result_hook
    self._batch_shared.conn.check_rpc_success(rpc)
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/datastore/datastore_rpc.py", line 1365, in check_rpc_success
    rpc.check_success()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/api/apiproxy_stub_map.py", line 614, in check_success
    self.__rpc.CheckSuccess()
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/api/apiproxy_rpc.py", line 149, in CheckSuccess
    raise self.exception
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/runtime/default_api_stub.py", line 276, in _CaptureTrace
    f(**kwargs)
  File "/layers/google.python.pip/pip/lib/python3.11/site-packages/google/appengine/runtime/default_api_stub.py", line 261, in _SendRequest
    raise self._ErrorException(*_DEFAULT_EXCEPTION)
google.appengine.runtime.apiproxy_errors.RPCFailedError: The remote RPC to the application server failed for call datastore_v3.RunQuery().

^^ (similar traces also with datastore_v3.Get()

We also see tons of these warnings, not related to any outgoing web calls our app is making: (I know it's just a warning and might be harmless, but it's worrying because of the extremely high volume in our logs):

Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10

I've been tweaking the app.yaml various ways, but I can't seem to find a configuration that solves or even significantly reduces either issue. Our app is doing the exact same operations as before with the same load. What can we try? This isn't sustainable, we can't afford this jump in cost and the app is performing much worse now. Our unit tests are running in about 30% less time compared to Python 2 so I didn't expect this, it seems like it could be I/O related?

Not sure if this is relevant, but I noticed this in google.appengine.api.apiproxy_rpc.py:

_MAX_CONCURRENT_API_CALLS = 100

_THREAD_POOL = futures.ThreadPoolExecutor(_MAX_CONCURRENT_API_CALLS)

But then in concurrent.futures.threads.py we see this comment in the constructor:

class ThreadPoolExecutor(_base.Executor):

    # Used to assign unique thread names when thread_name_prefix is not supplied.
    _counter = itertools.count().__next__

    def __init__(self, max_workers=None, thread_name_prefix='',
                 initializer=None, initargs=()):
        """Initializes a new ThreadPoolExecutor instance.

        Args:
            max_workers: The maximum number of threads that can be used to
                execute the given calls.
            thread_name_prefix: An optional name prefix to give our threads.
            initializer: A callable used to initialize worker threads.
            initargs: A tuple of arguments to pass to the initializer.
        """
        if max_workers is None:
            # ThreadPoolExecutor is often used to:
            # * CPU bound task which releases GIL
            # * I/O bound task (which releases GIL, of course)
            #
            # We use cpu_count + 4 for both types of tasks.
            # But we limit it to 32 to avoid consuming surprisingly large resource
            # on many core machine.
            max_workers = min(32, (os.cpu_count() or 1) + 4)

So the default number of workers would be 6, but apiproxy_rpc.py is setting it to 100 regardless of the actual number of CPU cores available. And, this default has a hardcoded limit of 32, because of reported "consuming surprisingly large resource." What led to the decision to use 100 here? (Edit: is this maybe OK because RPC calls aren't CPU-bound? Should the urllib3 connection pool size be increased to match this?)

pnico commented 11 months ago

Are some operations that were performed asynchronously before, in appengine Python 2, now being performed synchronously, or using a different concurrency approach that might especially disadvantage instances with a low number of cores, etc?

pnico commented 11 months ago

It looks like this does decrease the number of warnings. Is it safe, or helpful?

# in main.py
from google.appengine.runtime import default_api_stub
default_api_stub.URLLIB3_POOL_SIZE = 50

pnico commented 10 months ago

Also, increasing the number of threads seems to help reduce the RPCFailedErrors:

default_api_stub.MAX_CONCURRENT_API_CALLS = 300

This seems sketchy to do.. Is there any way around this? Could this be done without threads, was it using threads before?

spiqueras commented 10 months ago

I know this might not be helpful, but we saw the same log spam of Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10. One of the ways to trigger the spam is to use ndb IN queries, which internally fire multiple requests to datastore concurrently and then join the results. Here's a minimal service to reproduce the bug:

""" Repro service for https://github.com/GoogleCloudPlatform/appengine-python-standard/issues/105

Exposes a single endpoint @ / that fires an IN query
Modify the number of items through URL parameter count, like so:
- {SERVICE_URL}/?count=20

Any value over 10 will cause log spam of:
"Connection pool is full, discarding connection: appengine.googleapis.internal. Connection pool size: 10"
"""
import google.appengine.api
import google.cloud.logging
from flask import Flask, request
from google.appengine.ext import ndb

app = Flask(__name__)

class TestModel(ndb.Model):
    property = ndb.IntegerProperty()

@app.route('/', methods=['GET'])
def test():
    count = request.args.get('count', default=11, type=int)
    values = list(range(count))
    _ = TestModel.query(
        TestModel.property.IN(values)
    ).order(TestModel.key).fetch_page(1)
    return ''

client = google.cloud.logging.Client()
client.setup_logging()

app = google.appengine.api.wrap_wsgi_app(app)

What we ended up doing was to temporally filter them out with a logging filter, while we wait for a solution...

pnico commented 10 months ago

@spiqueras it makes sense. There are some places where we can economize the number of RPC calls - for example we use max_entity_groups_per_rpc because it's supposed to be a performance enhancement (maybe it was, at the time) but it's possible that's irrelevant now that the way RPC calls are processed is apparently very different now anyway. Tweaking threads / concurrent connections does reduce it somewhat too, for what that's worth. I'm not overly concerned with the log spam (from "Connection pool is full"), though, as it doesn't seem to affect functionality. (Edit: the RPCFailed errors are more worrying since they didn't happen before, and now we have to retry lots of operations that just worked almost all the time before, which slows down our application, increases costs etc)

The increased latency and cost to run our application is a concern, to put it mildly - it's fairly dramatic. Are you seeing any of this?

spiqueras commented 9 months ago

Yes, we've seen similar RPCFailed errors, particularly when writing instances in batch to datastore (more so when these instances are on the bigger side) although they're not as frequent as to pose a problem.

Increased latency/cost is something that we also had to deal with. Tweaking the default gunicorn and reviewing all dependencies have been key to keeping the costs down.

pnico commented 8 months ago

Tweaking the default gunicorn and reviewing all dependencies have been key to keeping the costs down.

With F1 instances this is the best we've come up with so far, but it's still a lot worse than before:

gunicorn -b :$PORT main:webapp --max-requests 5000 -w 1 --timeout=60

Any other tweaks you can recommend? Re. dependencies, is this mainly to conserve memory and startup/import time? We also tried tweaking autoscaling settings, but didn't see much improvement there compared to what we had before.

spiqueras commented 8 months ago

Re. dependencies, is this mainly to conserve memory and startup/import time?

Indeed.

Any other tweaks you can recommend?

I'm afraid not as we don't use F1. For beefier instance types, we've seen fewer instance crashes by adjusting down the number of workers, up the number of threads per worker and using --preload.

pnico commented 7 months ago

Ah good to know, we might experiment with instance size to see if it gives better value overall..

Just an update on this though, previously our protobuf version was limited to <4 because of some issues in unit tests (also this was with an earlier release of appengine-python-standard), so we thought it was necessary in production too. changing this to 4.25.2 improved latency by a lot, a bit closer to what we had before. From looking at other issues in this repo, I guess this can cause some issues with projection queries, but maybe using <4 will cause it to use the pure-python version? in any case, it seems to be working OK so far and instance counts are lower, fingers crossed. There should maybe be a note in the README about this.

toan-lea1 commented 1 month ago

Hi @pnico I'm having the same issue. Can you please share more detailed info how to resolve it? Is it actually working?

Thanks

pnico commented 1 month ago

@toan-le-theout I'm not working on the project that uses appengine any more, but I can say for that project the biggest meaningful difference was fixing the protubuf version and setting it to 4.25.2, which helped with latency a bit and reduced the number of instances being created. If you're already using a protobuf version >= 4 this doesn't help you, but you can try the gunicorn settings for number of workers and max requests along with appengine max requests and load balancing settings to optimize.

For the logs of "Connection pool is full", I did see a reduction of these by playing with default_api_stub.URLLIB3_POOL_SIZE and default_api_stub.MAX_CONCURRENT_API_CALLS - though I hesitate to recommend this, since it just seems like a sketchy thing to do. Also, I suspect that this log should really be considered just a minor warning, since I don't think it actually prevents new threads from being spawned when they're needed, my guess is it just reduces the annoying log spam, which you could choose to just ignore or filter out in your queries looking for errors. I didn't notice any actual performance improvement from messing with these settings, we ended up just calling our "solution" good enough.

For the RPCFailed errors, we already had a custom decorator to wrap API calls and retry them using a "full-jitter" approach like the one detailed here. Something like this is good to have in general anyway, because you can catch and retry on datastore/transaction failures which happen from time to time in any case. So we just added this error to our existing list of error types that would trigger a retry with a delay using ndb.sleep(); there are some python libraries for backoff + retry you could look into if you don't want to roll your own, but after looking at these I decided to do a simpler custom function since they seemed like overkill for what we needed in terms of features.

Something I was curious about at the time was the GIL-less Python in development allowing "true" multithreading - idk if this library will ever be updated to support that, but if it happens, that could be something to try in the future, although with small instances / few CPU cores it's hard to know if this would be better or worse.

codebykat commented 1 month ago

I noticed today that the other App Engine APIs recently received an update to allow upgrading protobuf to v5, and also that urllib3 has a v2 which includes a lot of changes around retry behavior. (See #121.) I have no idea if upgrading either of these dependencies would help, but if this library ever follows suit, I'll be curious to try them and see if we get fewer mysterious RPCFailed errors.

GoogleCloudPlatform / appengine-python-standard

Elevated latency, instance count and new errors/warnings compared to Python 2 runtime #105