elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
121 stars 131 forks source link

Fleet Server unhealthy on pr cloud deployment #575

Open juliaElastic opened 2 years ago

juliaElastic commented 2 years ago

Created cloud deployment from this pr: https://github.com/elastic/kibana/pull/134565 Cloud link: https://kibana-pr-134565.kb.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents password "8IViQT7Ol1Ki5J1ABFyda1hN" username "elastic"

Fleet Server shows up as unhealthy. Agent/Kibana version: 8.4.0 (main)

I checked another kibana pr where Fleet Server was healthy.

Cloud Admin URL: https://admin.found.no/deployments/efa47ae977e6d61437a23065eec13880

Agent Logs:

  | https://kibana-ops-buildkite-monitoring.kb.us-central1.gcp.cloud.es.io:9243/app/logs/link-to/host-logs/kb-n2-2-047bf3bf51b94a40?time=1655466891264

Seeing this in logs, might be related:

Jun 17, 2022 @ 14:01:11.000 time="2022-06-17T12:01:11.550880609Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {localhost  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc

Integration Server Logs: https://logging.us-west2.gcp.elastic-cloud.com/app/r/s/rancid-raspy-iron

ph commented 2 years ago

@michalpristas if you can investigate how serious this is and we can prioritize it.

juliaElastic commented 2 years ago

I noticed something strange in the package policies. There is a managed policy Elastic Cloud agent policy which uses Elastic cloud internal output. But when I query package policies from index agent, I see the apm and fleet server package policies use fleet-default-output, which is not the same as defined in agent policy. Maybe there is a bug in preconfig code in fleet?

.kibana/_search?q=type:ingest-package-policies

     "_id": "ingest-package-policies:elastic-cloud-fleet-server",
        "_score": 5.1603312,
        "_source": {
          "ingest-package-policies": {
            "name": "Fleet Server",
            "namespace": "default",
            "package": {
              "name": "fleet_server",
              "title": "Fleet Server",
              "version": "1.2.0"
            },
            "enabled": true,
            "policy_id": "policy-elastic-agent-on-cloud",
            "output_id": "fleet-default-output",
          }
}

.kibana/_search?q=type:ingest-outputs

          "ingest-outputs": {
            "name": "Elastic Cloud internal output",
            "type": "elasticsearch",
            "hosts": [
              "http://89a580b8af164e9d94dee28aed08d8b5.containerhost:9244"
            ],
            "is_default": false,
            "is_default_monitoring": false,
            "is_preconfigured": true,
            "output_id": "es-containerhost"
          },
          "type": "ingest-outputs",
          "references": [],
          "migrationVersion": {
            "ingest-outputs": "8.0.0"
          },
          "coreMigrationVersion": "8.4.0",
          "updated_at": "2022-06-17T18:16:11.538Z"
        }
      },
      {
        "_index": ".kibana_8.4.0_001",
        "_id": "ingest-outputs:a09a5397-7b9a-5a73-a622-e29f4c635658",
        "_score": 5.7339883,
        "_source": {
          "ingest-outputs": {
            "name": "default",
            "is_default": true,
            "is_default_monitoring": true,
            "type": "elasticsearch",
            "hosts": [
              "https://89a580b8af164e9d94dee28aed08d8b5.us-west2.gcp.elastic-cloud.com:443"
            ],
            "output_id": "fleet-default-output"

I could reproduce this locally:

image

EDIT: I checked in a 8.2.3 cloud staging instance, and the same setup is there for package policies and outputs like in 8.4. Fleet Server comes up as healthy there. Maybe something changed on agent/fleet server side on where the output host is taken from?

juliaElastic commented 2 years ago

I think this state of outputs is not the real problem, on other pr cloud deployments the same setup works fine with healthy Fleet Server: https://github.com/elastic/kibana/pull/131322

Only my two prs have this issue: https://github.com/elastic/kibana/pull/134673 https://github.com/elastic/kibana/pull/134565

michalpristas commented 2 years ago

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

juliaElastic commented 2 years ago

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

@michalpristas Here are the Agent logs from last 1 day. elastic_agent_and_apm_logs_134565.csv

I think one reason why this instance stopped is being out of memory, as the cloud-ci Integration Server has 512 MB RAM, and I tried to enroll 10k agents.

However I tried to start today a new ess cluster with oblt-cli and the fleet server does not start up at all on 8.4.0-SNAPSHOT. Agent logs of this instance: elastic_agent_logs_ess-sarxy-custom.csv

This might be the same issue: https://github.com/elastic/fleet-server/issues/1574