Open acrewdson opened 2 years ago
I've noticed this popping up in our logs a lot lately, I believe mostly since upgrading to 4.4.0.
This issue is a due to a limitation of using IO.pipe
and forking processes. It's not clear what the best course of action is at this point.
The agent uses an IO pipe to send data to the APM server. When the parent process is forked, the child inherits the agent's reader
s and (gzip)writer
s. We create new reader
s and writer
s in the child process so that they don't interfere with the parent's work.
When the fork exits, we close the child's reader
s and writers
but we see the follow error
zlib(finalizer): Zlib::GzipWriter object must be closed explicitly.
zlib(finalizer): the stream was freed prematurely.
see #356
because the parent's writer
is still considered alive and unclosed in the child process.
One option to address this error is to create a finalizer on the writer that closes it (see #787). But then we get intermittent errors from the APM server saying the gzipped data's header is invalid. This is most likely because the (parent's) writer is closed by the child before the compression is complete in the parent.
I've opened issues with Puma and Resque to ask if they have advice, as they are popular frameworks that use forking. Neither project said they had a suggestion for how to handle this scenario. See the issue opened with Puma See the issue opened with Resque
I've also opened a ticket with the Ruby MRI team to ask for advice here
At this point in my research, I've determined 4 options:
1) Close the readers and writers in the child process right after forking and open new ones. The result of choosing this option is intermittent gzip: invalid header
errors from the APM server and loss of data because closing the readers and writers interfere with the parent process's ability to fully compress and send data.
2) Don't close the readers and writers in the child process after forking but simply create new ones. The result of choosing this option is the following warning when the child process exists, for each writer that the parent process had when it was forked. This might indicate a memory leak, although it's not clear what the implications are beyond generating a warning.
zlib(finalizer): Zlib::GzipWriter object must be closed explicitly.
zlib(finalizer): the stream was freed prematurely.
3) In a before_fork
hook in the parent, flush the writers and close them along with the readers so that they are not inherited by the child process. The result of choosing this option is unnecessarily closing writers and GCing them along with the readers each time the process is forked. This could slow down the forking process and wastes a lot of objects. In a framework like Puma, which forks probably infrequently (depending on the load), this might not be a big deal. In a framework like Sidekiq or Resque that forks whenever there is work to be done, this drawback could be more significant.
4) Rearchitect the agent to not use a pipe.
Amazed that you figured this out, @estolfo.
@estolfo huge thanks for the great investigation and write-up here 👍
Any update on this being fixed. Our logs are getting filled up with this all the time.
Hi @jclusso this isn't really a bug that can be fixed. It's more a limitation of the Ruby language and I'm not sure there's anything we can do about it. I hoping to hear back from the Ruby core team at some point on my issue but it has been a while since I opened it and I'm not sure I'll get a response. We are going to take a look at this again in the near future and will update this issue if we find a different approach.
@estolfo is this something that is impacting us? If not, can we disable this from being logged out without disabling all error logging?
@jclusso There's no way to selectively disable this log message while enabling the other log messages, unfortunately. The message does indicate some data was not correctly compressed and sent to the APM server. The alternative is to ensure that the data isn't corrupted but you'll see these error messages in your logs instead: https://github.com/elastic/apm-agent-ruby/issues/356 As I mentioned, we are going to look at this again in the next few weeks so I'll update here if we find a different approach that wouldn't force us to choose between two options that both include errors.
Describe the bug
We're sometimes seeing the following warning logged by the Ruby APM agent:
Looking in the APM Server logs, I see corresponding 500s for the
/intake/v2/events
endpoint with the same error message.Steps to reproduce
No clear repro steps apart from having
http_compression
config option enabled. When I disable that option, I don't see the error.Expected behavior
No errors returned by APM server. We're definitely seeing events make it to the APM Server, but it's not clear if these errors mean we may be dropping events sometimes.
Environment
Additional context
One guess, based on this test case, is that the ruby agent is sometimes sending requests to the intake endpoint that include the
'Content-Encoding' => 'gzip'
header but where the payload isn't actually compressed. Seems like that's what the test case is demonstrating can return thegzip: invalid header
response.