elastic / apm-agent-python

https://www.elastic.co/guide/en/apm/agent/python/current/index.html
BSD 3-Clause "New" or "Revised" License
412 stars 219 forks source link

Error during to sending traces data to server somehow causes python server failure #1804

Open zoola969 opened 1 year ago

zoola969 commented 1 year ago

Describe the bug: I was faced with a very strange and weird issue. When ELK server fails and responds with 500 error, a current HTTP request fails too and my server responds with 500 status and plain text from ELK response. And this error doesn't pass through any middleware, I have a couple of them, one for sentry integration and another for error message transforming, and none of them was triggered.

This is a full logs from my app:

Failed to submit message: 'HTTP 500: <html>\r\n<head><title>500 Internal Server Error</title></head>\r\n<body>\r\n<center><h1>500 Internal Server Error</h1></center>\r\n<hr><center>nginx/1.20.1</center>\r\n</body>\r\n</html>\r\n'
 Traceback (most recent call last):
   File "/usr/local/lib/python3.11/site-packages/elasticapm/transport/base.py", line 279, in _flush
    self.send(data, forced_flush=forced_flush)
   File "/usr/local/lib/python3.11/site-packages/elasticapm/transport/http.py", line 114, in send
    raise TransportException(message, data, print_trace=print_trace)
 elasticapm.transport.exceptions.TransportException: HTTP 500: <html>
 <head><title>500 Internal Server Error</title></head>
 <body>
 <center><h1>500 Internal Server Error</h1></center>
 <hr><center>nginx/1.20.1</center>
 </body>
 </html>

To Reproduce

Unfortunately, I can't reproduce it in my dev configuration. It happens only in production environment. But I guess it happens like that:

  1. A python server receives HTTP request from a user for example on root path "/"
  2. Around the same time the elk-apm lib tries to send some data to elk server
  3. Server responds with 500 error
  4. ???
  5. Somehow this error passed through to end user, and the user sees 500 error with plain text from the original exception

Environment (please complete the following information)

Additional context

beniwohli commented 1 year ago

Hi @zoola969

Just to confirm, are you sure this is passed through to the end user or is that a suspicion based on logs? That particular log message is handled a bit special, as it does print the stack trace to the logs, which can be confused with an actual exception.

Specifically, in the case of an unexpected HTTP status, we raise a TransportException with print_trace=True here (that's also the originating line in the stack trace in your logs):

https://github.com/elastic/apm-agent-python/blob/d19c5ff5c0de0dc526d59e7e044194da37f36f2b/elasticapm/transport/http.py#L114

This exception is then caught here:

https://github.com/elastic/apm-agent-python/blob/d19c5ff5c0de0dc526d59e7e044194da37f36f2b/elasticapm/transport/base.py#L186-L187

and

https://github.com/elastic/apm-agent-python/blob/d19c5ff5c0de0dc526d59e7e044194da37f36f2b/elasticapm/transport/base.py#L338

As you can see, we emit a log message with level error, but don't let the exception bubble up.

zoola969 commented 1 year ago

Yes, I'm sure. This is a complaint from a user: image As you can see, this is the same as the message from the apm server. There are no error messages and errors in app logs as well. I have seen a source code of apm agent, where the exception should be caught, but something goes wrong and I get errors. My mind was completely blown, that's why I called this issue very strange

zoola969 commented 1 year ago

Also, I have a sentry integration and it didn't capture any errors. But in access logs I see 500 responses when apm agent fails

basepi commented 1 year ago

I've also looked through the code paths for this issue, and as far as I can see, your user's error report is likely a coincidence. As @beniwohli pointed out, we don't raise that exception.

I don't know how to take this issue further without a reproduction case, or at least a theory as to how this could happen. 🤔

zoola969 commented 1 year ago

This is not confidence. It happens several times on several apps, connected to elk. In all cases there is one scenario: elk server failures, 500-code responses in app/nginx access logs, and only TransportException errors in logs. I'll try to investigate and find how to reproduce it soon, for now I just turned off apm agent

basepi commented 1 year ago

Alright, I guess we need to clarify some more terms as well. What version of Elasticsearch are you running? Is your APM Server the same version?

When you say "elk server failures" do you mean that the APM Server is failing? I'm surprised you're getting 500 errors. If the APM Server were just overloaded, I would expect you to be getting 429 errors, see our handling code here:

https://github.com/elastic/apm-agent-python/blob/25eb517ee95607f940ea99c729db6cc4df00fe62/elasticapm/transport/http.py#L109-L117

What are the APM Server logs showing? I'm very surprised you're getting 500 errors in the first place.

Also, are you using the python Elasticsearch client library? Perhaps that could be why you're seeing similar errors in userspace -- perhaps Elasticsearch is giving 500 errors to your requests, which you're exposing to the user? That could explain the timing coincidence.

zoola969 commented 1 year ago

Hi @basepi We have elk server version 8.3.1 I found some strange log records around the incident time

07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][info] fail enroll  
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][error] perform rollback on enrollment failure  
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][info] fail enroll  
07:50:35.279 elastic_agent.fleet_server [elastic_agent.fleet_server][error] perform rollback on enrollment failure  

Also, there are some access logs in nginx before the amp server:

POST /config/v1/agents HTTP/1.1" 500 177 "-" "apm-agent-python/6.15.1
POST /intake/v2/events HTTP/1.1" 500 177 "-" "apm-agent-python/6.15.1
basepi commented 1 year ago

I found some strange log records around the incident time

Are you using elastic agent/fleet server to manage your APM server?

Also, there are some access logs in nginx before the amp server:

So it nginx returning the 500 or is nginx just reporting that the APM server reported a 500? I'm not certain what those logs are telling me.