GoogleCloudPlatform / ops-agent

Apache License 2.0
141 stars 68 forks source link

Not feasible to use vagrant locally to test integration with ops-agent as network check does not respect GCE_METADATA_HOST enviroment variable #1630

Open MikeMoore63 opened 9 months ago

MikeMoore63 commented 9 months ago

NOTE: To get the best support experience for bug fixes, please go to https://cloud.google.com/support-hub and follow the instructions. In comparison, Bug reports filed in this repo only have best effort support, and do not have guaranteed response / resolution SLOs

Describe the bug A clear and concise description of what the bug is. Normally to test locally I set the environment variable can be set to GCE_METADATA_HOST but in healthcheck ops_agent has hard coded https://github.com/GoogleCloudPlatform/ops-agent/blob/master/internal/healthchecks/network_check.go#L68 the url and therefore does not allow to be tested easily with meta data server implementations. So I am now looking to test an app locally that has jmx so want to run this and see ops agent integration while debugging the app. This though failed as it turns out the ops-agent fails to respect the GCE_METADATA_HOST google-cloud-ops-agent-fluent-bit.service does respect this as does google-cloud-ops-agent-diagnostics.service so issues is only in as it does not leverage metadata class to validate if on gce and this library respects this enviroment variable

To Reproduce Steps to reproduce the behavior:

  1. Start a VM locally using vagrant with image supported by ops agent '...' any
  2. Install Ops Agent version '...' following the steps '...' any
  3. Starup a test meta dataserver to allow you to test custom metrics off gcp and debug setup so java jmx so ilk of https://github.com/salrashid123/gce_metadata_server
  4. Customize systemdd config with
    [Manager]
    DefaultEnvironment=GCE_METADATA_HOST=localhost:8080
  5. restart the agent see error network health check fail failure is only with

[Network Check] Result: ERROR, Detail: Get "http://metadata.google.internal": dial tcp 169.254.169.254:80:

All other programs and agents such as fluent bit work fine bar ops-egent

  1. See error

Network Check] Result: ERROR, Detail: Get "http://metadata.google.internal": dial tcp 169.254.169.254:80:

Expected behavior

Should start without error A clear and concise description of what you expected to happen.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

braydonk commented 8 months ago

Hi @MikeMoore63,

I think that health check is non-blocking. The Ops Agent service should still start regardless of that health check failing. We should be able to make the health checks aware of that environment variable in the future, but it shouldn't block what you're trying to do currently as the Ops Agent should still start correctly.

It's worth noting that we currently only support use cases on a GCE VM, so anything Ops Agent related like the health checks actively make that assumption. We don't make any claims that it will work in any other environment, so there could be other things that break.

github-actions[bot] commented 1 week ago

This issue was marked stale due to lack of activity. It will be closed in 14 days.