Open zoola969 opened 1 year ago
Hi @zoola969
Just to confirm, are you sure this is passed through to the end user or is that a suspicion based on logs? That particular log message is handled a bit special, as it does print the stack trace to the logs, which can be confused with an actual exception.
Specifically, in the case of an unexpected HTTP status, we raise a TransportException
with print_trace=True
here (that's also the originating line in the stack trace in your logs):
This exception is then caught here:
and
As you can see, we emit a log message with level error
, but don't let the exception bubble up.
Yes, I'm sure. This is a complaint from a user: As you can see, this is the same as the message from the apm server. There are no error messages and errors in app logs as well. I have seen a source code of apm agent, where the exception should be caught, but something goes wrong and I get errors. My mind was completely blown, that's why I called this issue very strange
Also, I have a sentry integration and it didn't capture any errors. But in access logs I see 500 responses when apm agent fails
I've also looked through the code paths for this issue, and as far as I can see, your user's error report is likely a coincidence. As @beniwohli pointed out, we don't raise that exception.
I don't know how to take this issue further without a reproduction case, or at least a theory as to how this could happen. 🤔
This is not confidence. It happens several times on several apps, connected to elk. In all cases there is one scenario: elk server failures, 500-code responses in app/nginx access logs, and only TransportException errors in logs. I'll try to investigate and find how to reproduce it soon, for now I just turned off apm agent
Alright, I guess we need to clarify some more terms as well. What version of Elasticsearch are you running? Is your APM Server the same version?
When you say "elk server failures" do you mean that the APM Server is failing? I'm surprised you're getting 500 errors. If the APM Server were just overloaded, I would expect you to be getting 429
errors, see our handling code here:
What are the APM Server logs showing? I'm very surprised you're getting 500 errors in the first place.
Also, are you using the python Elasticsearch client library? Perhaps that could be why you're seeing similar errors in userspace -- perhaps Elasticsearch is giving 500 errors to your requests, which you're exposing to the user? That could explain the timing coincidence.
Hi @basepi We have elk server version 8.3.1 I found some strange log records around the incident time
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][info] fail enroll
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][error] perform rollback on enrollment failure
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][info] fail enroll
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][error] perform rollback on enrollment failure
Also, there are some access logs in nginx before the amp server:
POST /config/v1/agents HTTP/1.1" 500 177 "-" "apm-agent-python/6.15.1
POST /intake/v2/events HTTP/1.1" 500 177 "-" "apm-agent-python/6.15.1
I found some strange log records around the incident time
Are you using elastic agent/fleet server to manage your APM server?
Also, there are some access logs in nginx before the amp server:
So it nginx returning the 500 or is nginx just reporting that the APM server reported a 500? I'm not certain what those logs are telling me.
Describe the bug: I was faced with a very strange and weird issue. When ELK server fails and responds with 500 error, a current HTTP request fails too and my server responds with 500 status and plain text from ELK response. And this error doesn't pass through any middleware, I have a couple of them, one for sentry integration and another for error message transforming, and none of them was triggered.
This is a full logs from my app:
To Reproduce
Unfortunately, I can't reproduce it in my dev configuration. It happens only in production environment. But I guess it happens like that:
Environment (please complete the following information)
Additional context
Agent config options
Click to expand
``` APM_ENABLED: 'True' APM_DEBUG: 'False' APM_SERVICE_NAME: "project" APM_ENVIRONMENT: "production" APM_CLOUD_PROVIDER: "aws" APM_TRANSACTION_IGNORE_URLS: "[]" APM_SERVER_URL: "https://some.host" APM_SERVER_TIMEOUT: "10s" APM_SAMPLE_RATE: "1.0" APM_LOG_LEVEL: "info" APM_TRANSPORT_CLASS: "elasticapm.transport.http.Transport" ```requirements.txt
:Click to expand
``` alembic==1.9.4 anyio==3.6.2 asyncpg==0.27.0 authlib==1.2.0 certifi==2022.12.7 cffi==1.15.1 click==8.1.3 cryptography==39.0.2 dnspython==2.3.0 elastic-apm==6.15.1 email-validator==1.3.1 fastapi==0.92.0 greenlet==2.0.2 h11==0.14.0 httpcore==0.16.3 httpx==0.23.3 idna==3.4 jinja2==3.1.2 mako==1.2.4 markupsafe==2.1.2 prometheus-client==0.16.0 prometheus-fastapi-instrumentator==5.10.0 psycopg==3.1.8 pycparser==2.21 pydantic[email]==1.10.5 pyjwt==2.6.0 pyotp==2.8.0 python-multipart==0.0.6 rfc3986[idna2008]==1.5.0 sentry-sdk==1.16.0 sniffio==1.3.0 sqlalchemy==2.0.4 starlette==0.25.0 typing-extensions==4.5.0 urllib3==1.26.14 uvicorn==0.20.0 wrapt==1.15.0 ```