APM server sometimes responds with 'gzip: invalid header' errors

acrewdson commented 2 years ago

Describe the bug

We're sometimes seeing the following warning logged by the Ruby APM agent:

[ElasticAPM] APM Server responded with an error:
"{\"accepted\":2,\"errors\":[{\"message\":\"gzip: invalid header\"}]}\n"

Looking in the APM Server logs, I see corresponding 500s for the /intake/v2/events endpoint with the same error message.

Steps to reproduce

No clear repro steps apart from having http_compression config option enabled. When I disable that option, I don't see the error.

Expected behavior

No errors returned by APM server. We're definitely seeing events make it to the APM Server, but it's not clear if these errors mean we may be dropping events sometimes.

Environment

OS: [will provide out of band]
Ruby version: [will provide out of band]
Framework and version: [will provide out of band]
APM Server version: 7.15.1
Agent version: 4.4.0

Additional context

One guess, based on this test case, is that the ruby agent is sometimes sending requests to the intake endpoint that include the 'Content-Encoding' => 'gzip' header but where the payload isn't actually compressed. Seems like that's what the test case is demonstrating can return the gzip: invalid header response.

jclusso commented 2 years ago

I've noticed this popping up in our logs a lot lately, I believe mostly since upgrading to 4.4.0.

estolfo commented 2 years ago

This issue is a due to a limitation of using IO.pipe and forking processes. It's not clear what the best course of action is at this point.

The agent uses an IO pipe to send data to the APM server. When the parent process is forked, the child inherits the agent's readers and (gzip)writers. We create new readers and writers in the child process so that they don't interfere with the parent's work. When the fork exits, we close the child's readers and writers but we see the follow error

zlib(finalizer): Zlib::GzipWriter object must be closed explicitly.
zlib(finalizer): the stream was freed prematurely.

see #356 because the parent's writer is still considered alive and unclosed in the child process.

One option to address this error is to create a finalizer on the writer that closes it (see #787). But then we get intermittent errors from the APM server saying the gzipped data's header is invalid. This is most likely because the (parent's) writer is closed by the child before the compression is complete in the parent.

I've opened issues with Puma and Resque to ask if they have advice, as they are popular frameworks that use forking. Neither project said they had a suggestion for how to handle this scenario. See the issue opened with Puma See the issue opened with Resque

I've also opened a ticket with the Ruby MRI team to ask for advice here

At this point in my research, I've determined 4 options: 1) Close the readers and writers in the child process right after forking and open new ones. The result of choosing this option is intermittent gzip: invalid header errors from the APM server and loss of data because closing the readers and writers interfere with the parent process's ability to fully compress and send data. 2) Don't close the readers and writers in the child process after forking but simply create new ones. The result of choosing this option is the following warning when the child process exists, for each writer that the parent process had when it was forked. This might indicate a memory leak, although it's not clear what the implications are beyond generating a warning.

zlib(finalizer): Zlib::GzipWriter object must be closed explicitly.
zlib(finalizer): the stream was freed prematurely.

3) In a before_fork hook in the parent, flush the writers and close them along with the readers so that they are not inherited by the child process. The result of choosing this option is unnecessarily closing writers and GCing them along with the readers each time the process is forked. This could slow down the forking process and wastes a lot of objects. In a framework like Puma, which forks probably infrequently (depending on the load), this might not be a big deal. In a framework like Sidekiq or Resque that forks whenever there is work to be done, this drawback could be more significant. 4) Rearchitect the agent to not use a pipe.

mikker commented 2 years ago

Amazed that you figured this out, @estolfo.

acrewdson commented 2 years ago

@estolfo huge thanks for the great investigation and write-up here 👍

jclusso commented 2 years ago

Any update on this being fixed. Our logs are getting filled up with this all the time.

estolfo commented 2 years ago

Hi @jclusso this isn't really a bug that can be fixed. It's more a limitation of the Ruby language and I'm not sure there's anything we can do about it. I hoping to hear back from the Ruby core team at some point on my issue but it has been a while since I opened it and I'm not sure I'll get a response. We are going to take a look at this again in the near future and will update this issue if we find a different approach.

jclusso commented 2 years ago

@estolfo is this something that is impacting us? If not, can we disable this from being logged out without disabling all error logging?

estolfo commented 2 years ago

@jclusso There's no way to selectively disable this log message while enabling the other log messages, unfortunately. The message does indicate some data was not correctly compressed and sent to the APM server. The alternative is to ensure that the data isn't corrupted but you'll see these error messages in your logs instead: https://github.com/elastic/apm-agent-ruby/issues/356 As I mentioned, we are going to look at this again in the next few weeks so I'll update here if we find a different approach that wouldn't force us to choose between two options that both include errors.

elastic / apm-agent-ruby