Closed drazzib closed 3 years ago
@drazzib Thank you for reporting this. This is not a known issue. We will try to reproduce this issue.
Can you share your app.yaml
, requirements.txt
, Dockerfile
? I tried with Quickstart apps on Profiling Python applications | Cloud Profiler | Google Cloud and did not see that messages.
app.yaml
:
runtime: custom
env: flex
manual_scaling:
instances: 1
resources:
cpu: 4
memory_gb: 6
disk_size_gb: 20
env_variables:
FLASK_APP: main.py
FLASK_ENV: production
requirements.txt
:
wrapt==1.11.2
sklearn==0.0
Flask==1.1.1
gunicorn==20.0.4
future==0.18.2
h5py==2.10.0
google-cloud-storage==1.26.0
google-cloud-core==1.3.0
google-cloud-profiler==1.1.1
google-cloud-logging==1.15.0
pandas==1.0.2
datalab==1.1.5
Keras==2.3.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
tensorflow-cpu==2.1.0
cachetools==4.1.0
Dockerfile
:
FROM gcr.io/google-appengine/python
# Create a virtualenv for dependencies. This isolates these packages from
# system-level packages.
# Use -p python3 or -p python3.7 to select python version. Default is version 2.
RUN virtualenv -p python3 /env
# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH
# Copy the application's requirements.txt and run pip to install all
# dependencies into the virtualenv.
ADD requirements.txt /app/requirements.txt
RUN pip install --upgrade pip
RUN pip install -r /app/requirements.txt
# Add the application source code.
ADD . /app
# Run a WSGI server to serve the application. gunicorn must be declared as
# a dependency in requirements.txt.
CMD gunicorn -c gunicorn.conf.py -b :$PORT main:app
Hi @kalyanac & @jqll !
I've just seen that a new 1.1.2 release has been pushed to PyPI repository which contains 17d58aa3db8fd20844f3955ba62361c512b5592b commit.
Is this relevant for issue discussed above ?
Hi @drazzib, No the new release is not related to this issue. We are still having trouble reproducing this error.
Are you seeing any profiles collected despite these messages?
Yes, I can confirm that - even with this error printed very often on stderr there is still some collected profile visible on Google StackDriver Profiler UI
Thank you for the response. Few more questions:
Gentle ping.
Hi @kalyanac , sorry for the delay.
20.91 ms (100%), averaged over 11 profiles
10 s (100%), averaged over 9 profiles
Hi @drazzib, it seems like the agent is uploading profiles at a rate we expect it to. So we don't anticipate any loss of profile data at the moment.
We'll try to reproduce this issue and root cause the source of the message.
I just started having this same issue today with my CloudRun service
Libs: google-cloud-profiler = 1.1.2 gunicorn = 20.0.4 Django = 3
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/googlecloudprofiler/pythonprofiler.py", line 180, in _handler
if not self._started or self._in_handler:
BlockingIOError: [Errno 11] Resource temporarily unavailable"
@radyz thanks for reporting it.
Thanks for reporting the issue.
I think the problem is that the app enabled signal wakeup fd somewhere. When the profiler profiles, it triggers a SIGALRM/SIGPROF signal every 10ms. Since too many signals arrive too quickly, it may cause the fd’s buffer to be full, see https://docs.python.org/3/library/signal.html#signal.set_wakeup_fd:
There are two common ways to use this function. In both approaches, you use the fd to wake up when a signal arrives, but then they differ in how they determine which signal or signals have arrived.
In the first approach, we read the data out of the fd’s buffer, and the byte values give you the signal numbers. This is simple, but in rare cases it can run into a problem: generally the fd will have a limited amount of buffer space, and if too many signals arrive too quickly, then the buffer may become full, and some signals may be lost. If you use this approach, then you should set warn_on_full_buffer=True, which will at least cause a warning to be printed to stderr when signals are lost.
In the second approach, we use the wakeup fd only for wakeups, and ignore the actual byte values. In this case, all we care about is whether the fd’s buffer is empty or non-empty; a full buffer doesn’t indicate a problem at all. If you use this approach, then you should set warn_on_full_buffer=False, so that your users are not confused by spurious warning messages.
This will not affect the profiler. But if the app definitely needs to access the signal numbers as described in the first approach above, you probably should not use the profiler.
I'm also having this same thing happen and have for years. GAE Standard, both Python 3.7 and Python 3.8, Django 3.0.8, default Gunicorn setup. Also tried a while ago to use uwsgi and same thing happens.
I also notice that I don't get very useful information from within my python app inside of Profiler, which I wonder if is related.
@mgrdcm Can you open a new issue about not getting useful information from Profiler? We would like to understand what is happening and help address it.
Something on this that I've only noticed in the last 24 hours on GAE Standard is this:
Traceback (most recent call last): File "/layers/google.python.pip/pip/googlecloudprofiler/pythonprofiler.py", line 165, in _handler def _handler(self, unused_signum, frame): BlockingIOError: [Errno 11] Resource temporarily unavailable
Previously I was only getting the same block as mentioned in the more generic errors in the original post:
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
Hi @mgrdcm, sorry for the late response. When you see the new error below, does the profile you get contain "unknown" frames? If that's the case, could you share some setup so that we can reproduce the problem? Thanks!
BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:
@jqll I apologies, I missed this message until now. I stopped using cloud profiler due to this and a few other reasons. Has anything changed on this issue since September? If it hasn't been diagnosed or resolved I'm happy to add it back in and see if I'm still seeing the problem!
We have documented this in our known issues section of the documentation.
Closing because there is no further action expected on this issue.
@kalyanac I'm running a pretty basic Django app on GAE Standard, and I just searched through all of my code and all the dependencies and nothing is calling set_wakeup_fd
.
It is probably set by one of the Django or any other imports Django uses.
Heh, I was just arriving at that conclusion - it's gunicorn
itself that's doing it, which is the default server that Google App Engine Standard uses! None of the rest of the code I'm using in my project (directly or indirectly) does it.
So it looks like right now there's no way to use google-cloud-profiler
reliably on App Engine Standard. Unless we switch to a different non-default server?
@kalyanac Because of ☝️ meaning that you can't use one Google product (this package) on another (Google App Engine Standard with Python 3) without this happening, it kinda feels like a bug that should still be open. Maybe GAE offers a patched version of gunicorn
? Or this package has a workaround for this case?
@mgrdcm This is not specific to gunicorn
, any package or user code could set the same.
I appreciate your point regarding App Engine. We will post an update if we figure out a way to address this.
Has anyone discovered a temporary fix for this other than disabling the profiler?
Has anyone discovered a temporary fix for this other than disabling the profiler?
You can find the current solution here: https://cloud.google.com/profiler/docs/troubleshooting#python-blocking. I will also update the README to mention this.
Not a fix, but I switched to using the New Relic agent on GAE Standard Python 3 and it works wonderfully. Their new free tier is pretty generous, and so far I'm able to stay within it for my use case.
After adding
google-cloud-profiler==1.1.1
to my Python 3.7 App Engine project, I'm getting lots ofin my service stderr logs
I'm having the same issue either with App Engine Standard python3.7 or with App Engine Flex over a custom runtime (Docker image)
Is this a known issue, is there something I can do to mitigate this issue ?