BlockingIOError on App Engine after some time

drazzib commented 4 years ago

After adding google-cloud-profiler==1.1.1 to my Python 3.7 App Engine project, I'm getting lots of

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

in my service stderr logs

I'm having the same issue either with App Engine Standard python3.7 or with App Engine Flex over a custom runtime (Docker image)

Is this a known issue, is there something I can do to mitigate this issue ?

kalyanac commented 4 years ago

@drazzib Thank you for reporting this. This is not a known issue. We will try to reproduce this issue.

kalyanac commented 4 years ago

Can you share your app.yaml, requirements.txt, Dockerfile? I tried with Quickstart apps on Profiling Python applications | Cloud Profiler | Google Cloud and did not see that messages.

drazzib commented 4 years ago

app.yaml:

runtime: custom
env: flex
manual_scaling:
  instances: 1
resources:
  cpu: 4
  memory_gb: 6
  disk_size_gb: 20
env_variables:
  FLASK_APP: main.py
  FLASK_ENV: production

requirements.txt:

wrapt==1.11.2
sklearn==0.0
Flask==1.1.1
gunicorn==20.0.4
future==0.18.2
h5py==2.10.0
google-cloud-storage==1.26.0
google-cloud-core==1.3.0
google-cloud-profiler==1.1.1
google-cloud-logging==1.15.0
pandas==1.0.2
datalab==1.1.5
Keras==2.3.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
tensorflow-cpu==2.1.0
cachetools==4.1.0

Dockerfile:

FROM gcr.io/google-appengine/python

# Create a virtualenv for dependencies. This isolates these packages from
# system-level packages.
# Use -p python3 or -p python3.7 to select python version. Default is version 2.
RUN virtualenv -p python3 /env

# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH

# Copy the application's requirements.txt and run pip to install all
# dependencies into the virtualenv.
ADD requirements.txt /app/requirements.txt
RUN pip install --upgrade pip
RUN pip install -r /app/requirements.txt

# Add the application source code.
ADD . /app

# Run a WSGI server to serve the application. gunicorn must be declared as
# a dependency in requirements.txt.
CMD gunicorn -c gunicorn.conf.py -b :$PORT main:app

drazzib commented 4 years ago

Hi @kalyanac & @jqll !

I've just seen that a new 1.1.2 release has been pushed to PyPI repository which contains 17d58aa3db8fd20844f3955ba62361c512b5592b commit.

Is this relevant for issue discussed above ?

kalyanac commented 4 years ago

Hi @drazzib, No the new release is not related to this issue. We are still having trouble reproducing this error.

kalyanac commented 4 years ago

Are you seeing any profiles collected despite these messages?

drazzib commented 4 years ago

Yes, I can confirm that - even with this error printed very often on stderr there is still some collected profile visible on Google StackDriver Profiler UI

kalyanac commented 4 years ago

Thank you for the response. Few more questions:

What profile types do you see in the Profiler UI?
For each profile type, select a 10 minutes in Timespan, and check the profile count shown in the root node. The root node has the profile count displayed as "... averaged over n profiles". Please report what the n is.
If the service is not running currently, please set the end time to a time when the service was running by clicking "Now" and setting the end time.

kalyanac commented 4 years ago

Gentle ping.

drazzib commented 4 years ago

Hi @kalyanac , sorry for the delay.

In Profiler UI, I see both "CPU Time" et "Wall time" for my application
For each profile type, under a 10 minutes in Timespan, I can see
- for "CPU Time" 20.91 ms (100%), averaged over 11 profiles
- for "Wall Time" 10 s (100%), averaged over 9 profiles

kalyanac commented 4 years ago

Hi @drazzib, it seems like the agent is uploading profiles at a rate we expect it to. So we don't anticipate any loss of profile data at the moment.

We'll try to reproduce this issue and root cause the source of the message.

radyz commented 4 years ago

I just started having this same issue today with my CloudRun service

Libs: google-cloud-profiler = 1.1.2 gunicorn = 20.0.4 Django = 3

 Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/googlecloudprofiler/pythonprofiler.py", line 180, in _handler
    if not self._started or self._in_handler:
BlockingIOError: [Errno 11] Resource temporarily unavailable"

kalyanac commented 4 years ago

@radyz thanks for reporting it.

jqll commented 4 years ago

Thanks for reporting the issue.

I think the problem is that the app enabled signal wakeup fd somewhere. When the profiler profiles, it triggers a SIGALRM/SIGPROF signal every 10ms. Since too many signals arrive too quickly, it may cause the fd’s buffer to be full, see https://docs.python.org/3/library/signal.html#signal.set_wakeup_fd:

There are two common ways to use this function. In both approaches, you use the fd to wake up when a signal arrives, but then they differ in how they determine which signal or signals have arrived.

In the first approach, we read the data out of the fd’s buffer, and the byte values give you the signal numbers. This is simple, but in rare cases it can run into a problem: generally the fd will have a limited amount of buffer space, and if too many signals arrive too quickly, then the buffer may become full, and some signals may be lost. If you use this approach, then you should set warn_on_full_buffer=True, which will at least cause a warning to be printed to stderr when signals are lost.

In the second approach, we use the wakeup fd only for wakeups, and ignore the actual byte values. In this case, all we care about is whether the fd’s buffer is empty or non-empty; a full buffer doesn’t indicate a problem at all. If you use this approach, then you should set warn_on_full_buffer=False, so that your users are not confused by spurious warning messages.

This will not affect the profiler. But if the app definitely needs to access the signal numbers as described in the first approach above, you probably should not use the profiler.

mgrdcm commented 4 years ago

I'm also having this same thing happen and have for years. GAE Standard, both Python 3.7 and Python 3.8, Django 3.0.8, default Gunicorn setup. Also tried a while ago to use uwsgi and same thing happens.

I also notice that I don't get very useful information from within my python app inside of Profiler, which I wonder if is related.

kalyanac commented 4 years ago

@mgrdcm Can you open a new issue about not getting useful information from Profiler? We would like to understand what is happening and help address it.

mgrdcm commented 4 years ago

Something on this that I've only noticed in the last 24 hours on GAE Standard is this:

Traceback (most recent call last): File "/layers/google.python.pip/pip/googlecloudprofiler/pythonprofiler.py", line 165, in _handler def _handler(self, unused_signum, frame): BlockingIOError: [Errno 11] Resource temporarily unavailable

Previously I was only getting the same block as mentioned in the more generic errors in the original post:

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

jqll commented 4 years ago

Hi @mgrdcm, sorry for the late response. When you see the new error below, does the profile you get contain "unknown" frames? If that's the case, could you share some setup so that we can reproduce the problem? Thanks!

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

mgrdcm commented 3 years ago

@jqll I apologies, I missed this message until now. I stopped using cloud profiler due to this and a few other reasons. Has anything changed on this issue since September? If it hasn't been diagnosed or resolved I'm happy to add it back in and see if I'm still seeing the problem!

kalyanac commented 3 years ago

We have documented this in our known issues section of the documentation.

Closing because there is no further action expected on this issue.

mgrdcm commented 3 years ago

@kalyanac I'm running a pretty basic Django app on GAE Standard, and I just searched through all of my code and all the dependencies and nothing is calling set_wakeup_fd.

kalyanac commented 3 years ago

It is probably set by one of the Django or any other imports Django uses.

kalyanac commented 3 years ago

Here is an example from gunicorn

mgrdcm commented 3 years ago

Heh, I was just arriving at that conclusion - it's gunicorn itself that's doing it, which is the default server that Google App Engine Standard uses! None of the rest of the code I'm using in my project (directly or indirectly) does it.

So it looks like right now there's no way to use google-cloud-profiler reliably on App Engine Standard. Unless we switch to a different non-default server?

mgrdcm commented 3 years ago

@kalyanac Because of ☝️ meaning that you can't use one Google product (this package) on another (Google App Engine Standard with Python 3) without this happening, it kinda feels like a bug that should still be open. Maybe GAE offers a patched version of gunicorn? Or this package has a workaround for this case?

kalyanac commented 3 years ago

@mgrdcm This is not specific to gunicorn, any package or user code could set the same. I appreciate your point regarding App Engine. We will post an update if we figure out a way to address this.

zackarydev commented 3 years ago

Has anyone discovered a temporary fix for this other than disabling the profiler?

wyk9787 commented 3 years ago

Has anyone discovered a temporary fix for this other than disabling the profiler?

You can find the current solution here: https://cloud.google.com/profiler/docs/troubleshooting#python-blocking. I will also update the README to mention this.

mgrdcm commented 3 years ago

Not a fix, but I switched to using the New Relic agent on GAE Standard Python 3 and it works wonderfully. Their new free tier is pretty generous, and so far I'm able to stay within it for my use case.

GoogleCloudPlatform / cloud-profiler-python

BlockingIOError on App Engine after some time #50