cloudfoundry / diego-windows-release

Apache License 2.0
9 stars 13 forks source link

Missing zone in metron agent config #20

Closed amhuber closed 7 years ago

amhuber commented 7 years ago

Metron now tries to connect to a Doppler in it's zone for gRPC but you are not writing the zone to the Metron config.json on Windows.

References:

Metron code looking for zone: https://github.com/cloudfoundry/loggregator/blob/master/src/metron/clientpool/grpc_connector.go#L42

Metron config on Linux: https://github.com/cloudfoundry/loggregator/blob/develop/jobs/metron_agent/templates/metron_agent.json.erb#L29

Error messages on Windows due to the missing zone:

2017/01/08 20:39:10 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup .doppler.service.cf.internal: getaddrinfow: No such host is known."; Reconnecting to {.doppler.service.cf.internal:8082 } 2017/01/08 20:39:10 Failed to dial .doppler.service.cf.internal:8082: context canceled; please retry.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/137241663

The labels on this github issue will be updated when the story is started.

sunjayBhatia commented 7 years ago

Hey @amhuber. This has actually always been the case in one way or another, and we purposefully omit the zone on Windows for ease of install and configuration in a non-bosh world. See https://github.com/cloudfoundry-attic/diego-windows-msi/commit/e2ec8f0f2344d45e3543bed0862c3e729b1e15a2.

With bosh-windows, the zone will be appropriately set and zone redundancy will work as expected. bosh-windows is about to be GA, at which point the MSI workflow will be deprecated.

amhuber commented 7 years ago

With respect that seems rather broken. The metron agent is running in a tight loop trying to find the invalid DNS entry and reconnecting constantly. Logging may be "working" but metron is consuming a lot of CPU and thrashing the event logs with errors.

I understand the long term fix is to switch to the BOSH agent which will use a real zone instead of your fake "windows" one, but why not do that now? I've been running with the MSIs for almost a year in production using the correct zone of "z1" or "z2" as appropriate with no issues on the Windows cells. Quit using the invalid "windows" zone, put the metron zone back in the configuration, and everything will start working as it's supposed to without a hack. The alternative if you want to keep using the "windows" zone is to add a tag onto the doppler service in the CF config file called "windows" so the DNS name would resolve correctly. Either way the metron service would stop failing and spewing logs.

amhuber commented 7 years ago

Just to be clear, even though it has been this way for a while as you said, it's only a serious issue now due to the gRPC changes in metron. Before that metron didn't seem to care that the zone wasn't configured. I checked my production servers running older code and there are no errors. Using the latest release with the new metron code, metron is spewing errors continuously.

I can't go to production using the code as is, so please reconsider putting the zone back into the metron config to fix this until the BOSH agent is ready.

amhuber commented 7 years ago

@sbenario, can you take a look at this issue and see if this is potentially something that can be addressed using one of the options above?

sbenario commented 7 years ago

Sorry Aaron, I'm no longer the PM for this project (announced at the runtime PMC several months ago). @awmartin and @aminjam should be able to take a look though :-)

amhuber commented 7 years ago

Ah, sorry, I must have missed the announcement but thanks for pulling in the correct new owners.

sbenario commented 7 years ago

no problem!

On Tue, Jan 10, 2017 at 12:58 PM, Aaron Huber notifications@github.com wrote:

Ah, sorry, I must have missed the announcement but thanks for pulling in the correct new owners.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cloudfoundry/diego-windows-release/issues/20#issuecomment-271648846, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXRx49O1GrVxxLJXG5wZusI2ZyNJEn8ks5rQ8bYgaJpZM4Ld7u8 .

--

Steven Benario | Strategic Product Owner

sbenario@pivotal.io

awmartin commented 7 years ago

@amhuber We have the issue tracked and prioritized. However, because the team's focus is on BOSH-Windows GA, we may not get to this is a predictable timeframe ourselves. PRs are welcome, and we'll review them when they come in.