elastic / apm-agent-ruby

https://www.elastic.co/solutions/apm
Apache License 2.0
168 stars 131 forks source link

transactions working but metrics are not (on Windows) "Errno::EINVAL: Invalid Argument" #733

Open Ex-Ark opened 4 years ago

Ex-Ark commented 4 years ago

Describe the bug

On my kibana, transactions are being correctly sent and received but metrics are not.

I am seeing an infinite loop of errors in my rails logs

[ElasticAPM] Couldn't establish connection to APM Server:
"#<Errno::EINVAL: Invalid argument>"
[ElasticAPM] Collecting metrics
[ElasticAPM] [THREAD:76790960]: Opening new request
[ElasticAPM] [THREAD:87894000]: Closing request with reason timeout
[ElasticAPM] Closing writer with reason timeout
[ElasticAPM] Couldn't establish connection to APM Server:
"#<Errno::EINVAL: Invalid argument>"

Once in a while I get errors like this also :

[ElasticAPM] Collecting metrics
[ElasticAPM] [THREAD:76790960]: Opening new request
[ElasticAPM] Failed fetching config: {"error":"forbidden request: Agent remote configuration is disabled. Configure the `apm-server.kibana` section in apm-server.yml to enable it. If you are using a RUM agent, you also need to configure the `apm-server.rum` section. If you are not u
sing remote configuration, you can safely ignore this error."}
, trying again in 300 seconds
[ElasticAPM] [THREAD:87894000]: Closing request with reason timeout
[ElasticAPM] Closing writer with reason timeout
[ElasticAPM] Couldn't establish connection to APM Server:
"#<Errno::EINVAL: Invalid argument>"

Steps to reproduce

apm server & elk stack

  1. git clone https://github.com/deviantony/docker-elk.git
  2. cd docker-elk
  3. docker-compose -f docker-compose.yml -f extensions/apm-server/apm-server-compose.yml up

rails app

  1. add gem 'elastic-apm' in Gemfile
  2. bundle install
  3. rails s

Expected behavior

Metrics should be sent

Environment

Additional context

when reaching http://localhost:8200 it works, and I can curl and HTTP.get it too.

my elastic_apm.yml config :

server_url: http://localhost:8200
log_level: debug

I read other similar issues, and tried things suggested like putting http_compression: false and api_request_time: 30s, or ignore pattern but I have none, with no success.

estolfo commented 4 years ago

Hi @Ex-Ark! Sorry to hear that you’ve having trouble connecting to the APM server from your Rails app. We have a few questions about your setup: You are running the Docker container on a Windows machine but Docker can be run in two flavors on Windows: purely Linux-based or Windows-based. Which way are you running it? Secondly, can you give us more information about where the rails app is running? The error is coming from the APM agent, which is running within the Rails app so we’d like to know more about that Docker image. Does it share a network with the docker-elk image? The same questions about the docker flavor for Windows applies here too. Thanks!

Ex-Ark commented 4 years ago

Yes sorry for any possible confusion, here is the detailed env :

I'm using Docker Desktop with linux containers in order to run the ELK stack (as well as the APM server).

My rails app is directly running on my windows machine, using ruby mingw32.

Everything is run on my locale machine.

My agent communicates with my APM server using port binding (host machine <=> container)


I have a strong suspicion that it is an OS problem, because after some troubleshooting and testing on my staging server (which is a debian), I managed with the same config & code to push both transactions and metrics (errors as well) only on debian.

I noticed by reading the code that the data is not directly sent in json format, but rather wrapped in IO object. Could it be coming from this IO object sent by the http client ? It's the only thing that could differ in your lib based on OS (encoding ? line ending ?).

So basically this means that right now the agent is unable to send metric if the app is run on a windows machine ?

estolfo commented 4 years ago

Hi @Ex-Ark The code that I believe is causing that error are the lines in this file trying to access /proc/* files in a Linux system. Here is an example.

From what you've said about your setup, it sounds like this code should determine that your system is not Linux and thus not collect system metrics. Is this line returning :linux in your Rails app?

Ex-Ark commented 4 years ago

Here is the platform line output

def self.platform
      @platform ||= Gem::Platform.local.os.to_sym
      puts "PLATFORM: #{@platform}"
end
# => PLATFORM: mingw32

So the intended behaviour is that the metrics should not be collected ? This is a really strange because I am able to log metric data from inside the gem by putting some debug prints before the http request is sent :

def write(str)
          debug "WRITE: #{str}"
          @wr.write(str)
          @wr.bytes_sent
end
#[ElasticAPM] WRITE: {"metadata":{"service":{"name":"MYAPP","environment":"development","version":"434da8edf0be3c5cc34e0af63d859039da3710f5","agent":{"name":"ruby","version":"3.5.0"},"framework":{"name":"Ruby on Rails","version":"6.0.2.1"},"language":{"name":"ruby","version":"2.5.7"},"runtime":{"name":"ruby","version":"2.5.7"}},"process":{"pid":812,"title":"bin/rails","argv":[]},"system":{"hostname":"MY-MACHINE","architecture":"x64","platform":"mingw32","kubernetes":{"namespace":null,"node":{"name":null},"pod":{"name":null,"uid":null}}},"labels":null}}
#[ElasticAPM] WRITE: {"metricset":{"timestamp":1583833395375509,"samples":{"ruby.gc.count":{"value":63},"ruby.threads":{"value":17},"ruby.heap.slots.live":{"value":586818},"ruby.heap.slots.free":{"value":519},"ruby.heap.allocations.total":{"value":2265281}}}}
*/

Also I haven't seen any warning Skipping metrics and Unsupported platform '#{platform}' - Disabling system metrics in my logs.

estolfo commented 4 years ago

Hi @Ex-Ark only the system metrics will be disabled, not the other ones (that you're seeing)

estolfo commented 4 years ago

Since you're already putting puts in the agent code, can you see if one of the IO.readlines in this file is reached and throws an error?

Ex-Ark commented 4 years ago

I added

puts "('/proc/stat') readlines reached!"
puts "('/proc/self/stat') readlines reached!"
puts "('/proc/memeinfo) readlines reached!"

before each IO.readlines in cpu_mem_set.rb but none got logged so I believe the linux platform check is working

Ex-Ark commented 4 years ago

While we're at the IO level, could it be due to IO.pipe in this file, as this ruby bug described here ?

Are you using offset in your IO ? I still have this 4 year old ruby bug in my irb :

p IO.read('|cmd.exe /C echo hello', 1, 1)
# Traceback (most recent call last):
#        3: from C:/Ruby25-x64/bin/irb.cmd:19:in `<main>'
#        2: from (irb):2
#        1: from (irb):2:in `read'
# Errno::EINVAL (Invalid argument)

p IO.read('|cmd.exe /C echo hello')
# "hello\n"
#=> "hello\n"
estolfo commented 4 years ago

hmm, I can't find a place where we are using an offset and that error might be raised but thank you for pointing us to that ruby bug! If there's anything else you can do to help pinpoint which line in our agent might be throwing that error, we'd really appreciate it!

Ex-Ark commented 4 years ago

Managed to get metrics to be sent ! Maybe it has something to do with how httprb/http handles IO streams on Windows , because manually parsing the stream into string, and then passing it to the http client works :

def request(method, url, body: nil, headers: nil)
          body = body.read # here
          puts "BODY: #{body}" # to be sure
          @client.send(
            method,
            url,
            body: body,
            headers: (headers ? @headers.merge(headers) : @headers).to_h,
            ssl_context: @config.ssl_context
          ).flush
end
[ElasticAPM] Collecting metrics
[ElasticAPM] [THREAD:59257540]: Opening new request
[ElasticAPM] [THREAD:57576220]: Closing request with reason timeout
[ElasticAPM] Closing writer with reason timeout
BODY: {"metadata":{"service":{"name":"MY-APP","environment":"development","version":"434da8edf0be3c5cc34e0af63d859039da3710f5","agent":{"name":"ruby","version":"3.6.0"},"framework":{"name":"Ruby on Rails","version":"6.0.2.1"},"language":{"name":"ruby","version":"2.5.7"},"ru
ntime":{"name":"ruby","version":"2.5.7"}},"process":{"pid":6844,"title":"bin/rails","argv":[]},"system":{"hostname":"MY-MACHINE","architecture":"x64","platform":"mingw32","kubernetes":{"namespace":null,"node":{"name":null},"pod":{"name":null,"uid":null}}},"labels":null}}
{"metricset":{"timestamp":1583848253220822,"samples":{"ruby.gc.count":{"value":60},"ruby.threads":{"value":18},"ruby.heap.slots.live":{"value":446030},"ruby.heap.slots.free":{"value":696},"ruby.heap.allocations.total":{"value":2295426}}}}
[ElasticAPM] APM Server responded with status 202

APM Server finally accept the metric, and I can actually see the data in my dashboard (ex: aggregation max ruby.heap.slots.free => 696 Note : I can not see it in Kibana's APM section directly because it only shows system metrics, which are not collected on Windows.

I noticed by reading the lib some kind of patch were necessary due to http.rb IO handling itself, so it's possible to think another local patch might be necessary or at least report the issue to httprb if it proves to be the same case elsewhere.

# Http.rb<4 calls rewind on the request bodies, but IO::Pipe raises
# ~mikker
return if HTTP::VERSION.to_i >= 4
def rd.rewind; end

For the sake of it I tried this solution (body.read) on debian alas it raises this :

E, [2020-03-10T14:40:34.372209 #19703] ERROR -- : [ElasticAPM] Failed converting event to JSON: <ElasticAPM::Metricset timestamp:1583847634370680 transaction:nil span:nil tags:nil samples:{:"system.cpu.total.norm.pct"=>NaN, :"system.memory.actual.free"=>5523611648, :"system.memory.total"=>8366317568, :"system.process.cpu.total.norm.pct"=>NaN, :"system.process.memory.size"=>684482560, :"system.process.memory.rss.bytes"=>113922048}>

so I don't know how to provide a fix while keeping the agent working on other platforms.

I won't dig deeper into this, I have an heterogeneous network of rails app (both on windows and debian), such a shame I'll only be able to use it on half my apps only.

Thank you for this gem and for your help

estolfo commented 4 years ago

Thank you so much for this insight! Let's keep the issue open though so that we can look into whether there's a solution that works on both platforms.