cloudfoundry / loggregator-release

Cloud Native Logging
Apache License 2.0
215 stars 149 forks source link

metron_agent startup failures #169

Closed ChrisMcGowan closed 7 years ago

ChrisMcGowan commented 7 years ago

Running into issues with metron_agent on Digeo cells. CF v243 and Diego 0.1485.0

On initial deployment of the cells, monit and BOSH both report that metron_agent is running but smoke_tests fail on missing app logs. Looking at the metron_agent logs (/var/vcap/sys/logs/metron_agent) on the cells, the following messages appear:

metron_agent.stdout.log:

Start of log at VM deployment:
{"timestamp":1478789343.195786953,"process_id":7992,"source":"metron","log_level":"info","message":"Startup: Setting up the Metron agent","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":77,"method":"main.main"}
{"timestamp":1478789343.201402187,"process_id":7992,"source":"metron","log_level":"warn","message":"Failed to connect to etcd","data":{"error":"sync cluster failed"},"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":122,"method":"main.adapter"}
{"timestamp":1478789343.329437256,"process_id":7992,"source":"metron","log_level":"error","message":"Finder: Watch sent an error","data":{"error":"store request timed out"},"file":"/var/vcap/data/compile/metron_agent/loggregator/src/doppler/dopplerservice/finder.go","line":201,"method":"doppler/dopplerservice.(*Finder).run"}
{"timestamp":1478789343.457574844,"process_id":7992,"source":"metron","log_level":"info","message":"Listening on 127.0.0.1:3457","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/networkreader/network_reader.go","line":33,"method":"metron/networkreader.New"}
{"timestamp":1478789343.457743645,"process_id":7992,"source":"metron","log_level":"info","message":"metron started","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":94,"method":"main.main"}
{"timestamp":1478789343.457979679,"process_id":7992,"source":"metron","log_level":"error","message":"Finder: Watch sent an error","data":{"error":"store request timed out"},"file":"/var/vcap/data/compile/metron_agent/loggregator/src/doppler/dopplerservice/finder.go","line":201,"method":"doppler/dopplerservice.(*Finder).run"}

Then constant repeat of these till restart:

{"timestamp":1478789463.198026419,"process_id":7992,"source":"metron","log_level":"warn","message":"EventMarshaller: Write called while byteWriter is nil","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/eventmarshaller/event_marshaller.go","line":59,"method":"metron/writers/eventmarshaller.(*EventMarshaller).Write"}

metron_agent_logrotate_cron.log:

2016/11/10 14:51:22 RuntimeStats: failed to emit: EventWriter: No envelope writer set (see SetWriter)

After monit restart metron_agent on the VM, no other changes/no re-deployments -

metron_agent.stdout.log

{"timestamp":1478789482.261008739,"process_id":8853,"source":"metron","log_level":"info","message":"Startup: Setting up the Metron agent","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":77,"method":"main.main"}
{"timestamp":1478789482.713376284,"process_id":8853,"source":"metron","log_level":"info","message":"Listening on 127.0.0.1:3457","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/networkreader/network_reader.go","line":33,"method":"metron/networkreader.New"}
{"timestamp":1478789482.713675261,"process_id":8853,"source":"metron","log_level":"info","message":"metron started","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":94,"method":"main.main"}
{"timestamp":1478789482.714495897,"process_id":8853,"source":"metron","log_level":"info","message":"Chose protocol udp from last etcd event, updating writer...","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/main.go","line":211,"method":"main.initializeDopplerPool.func1"}

No other log changes

At this point I don't know why this is happening. I did try this on 2 separate deployments in 2 separate environments and can confirm the issue in both. Once metron_agent is restarted, it works as intended and it survives future restarts.

For reference, here is a sample of the deployment manifest global properties for the metron_agents:

loggregator:
  etcd:
    ca_cert: |
      -----BEGIN CERTIFICATE-----
           <info removed/validated against running etcd in CF>
      -----END CERTIFICATE-----
    machines:
    - cf-etcd.service.cf.internal
    require_ssl: true
  port: 4443
  tls:
    ca_cert: |
      -----BEGIN CERTIFICATE-----
      <info removed/validated against running etcd in CF>
      -----END CERTIFICATE-----
loggregator_endpoint:
  shared_secret: <info removed/validated against CF>
metron_agent:
  deployment: <deployment name>
  etcd:
    client_cert: |
      -----BEGIN CERTIFICATE-----
      <info removed/validated against running etcd in CF>
      -----END CERTIFICATE-----
    client_key: |
      -----BEGIN RSA PRIVATE KEY-----
      <info removed/validated against running etcd in CF>
      -----END RSA PRIVATE KEY-----
  tls:
    client_cert: |
      -----BEGIN CERTIFICATE-----
      <info removed/validated against running etcd in CF>
      -----END CERTIFICATE-----
    client_key: |
      -----BEGIN RSA PRIVATE KEY-----
      <info removed/validated against running etcd in CF>
      -----END RSA PRIVATE KEY-----
metron_endpoint:
  shared_secret:  <info removed/validated against CF>

Thanks for your help.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/134082729

The labels on this github issue will be updated when the story is started.

hev commented 7 years ago

@ChrisMcGowan this was likely due to a bad state in etcd. We have plans to improve our service discovery and key value store that should address this in a future release.

geofffranks commented 7 years ago

@ahevenor Were the service discovery/kv store improvements implemented? Do you happen to know which version of cf/pcf it would be resolved in?

hev commented 7 years ago

@geofffranks - yes, there are a few fixes for this now actually:

geofffranks commented 7 years ago

Thank you!

Sent from my iPhone

On Sep 27, 2017, at 4:41 PM, Adam Hevenor notifications@github.com wrote:

@geofffranks - yes, there are a few fixes for this now actually:

Loggregator 72 removes the etcd dependency for Metron. Packaged in cf release v251 Packaged in PCF 1.10 We also tracked down a fix for this and patched Loggregator 70.2 for operators to update problematic metrons if you are using an older version of runtime. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.