Fleet Server unhealthy on pr cloud deployment

juliaElastic commented 2 years ago

Created cloud deployment from this pr: https://github.com/elastic/kibana/pull/134565 Cloud link: https://kibana-pr-134565.kb.us-west2.gcp.elastic-cloud.com:9243/app/fleet/agents password "8IViQT7Ol1Ki5J1ABFyda1hN" username "elastic"

Fleet Server shows up as unhealthy. Agent/Kibana version: 8.4.0 (main)

I checked another kibana pr where Fleet Server was healthy.

Cloud Admin URL: https://admin.found.no/deployments/efa47ae977e6d61437a23065eec13880

Agent Logs:

| https://kibana-ops-buildkite-monitoring.kb.us-central1.gcp.cloud.es.io:9243/app/logs/link-to/host-logs/kb-n2-2-047bf3bf51b94a40?time=1655466891264

Seeing this in logs, might be related:

Jun 17, 2022 @ 14:01:11.000 time="2022-06-17T12:01:11.550880609Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {localhost  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing only one connection allowed\". Reconnecting..." module=grpc

Integration Server Logs: https://logging.us-west2.gcp.elastic-cloud.com/app/r/s/rancid-raspy-iron

ph commented 2 years ago

@michalpristas if you can investigate how serious this is and we can prioritize it.

juliaElastic commented 2 years ago

I noticed something strange in the package policies. There is a managed policy Elastic Cloud agent policy which uses Elastic cloud internal output. But when I query package policies from index agent, I see the apm and fleet server package policies use fleet-default-output, which is not the same as defined in agent policy. Maybe there is a bug in preconfig code in fleet?

.kibana/_search?q=type:ingest-package-policies

     "_id": "ingest-package-policies:elastic-cloud-fleet-server",
        "_score": 5.1603312,
        "_source": {
          "ingest-package-policies": {
            "name": "Fleet Server",
            "namespace": "default",
            "package": {
              "name": "fleet_server",
              "title": "Fleet Server",
              "version": "1.2.0"
            },
            "enabled": true,
            "policy_id": "policy-elastic-agent-on-cloud",
            "output_id": "fleet-default-output",
          }
}

.kibana/_search?q=type:ingest-outputs

          "ingest-outputs": {
            "name": "Elastic Cloud internal output",
            "type": "elasticsearch",
            "hosts": [
              "http://89a580b8af164e9d94dee28aed08d8b5.containerhost:9244"
            ],
            "is_default": false,
            "is_default_monitoring": false,
            "is_preconfigured": true,
            "output_id": "es-containerhost"
          },
          "type": "ingest-outputs",
          "references": [],
          "migrationVersion": {
            "ingest-outputs": "8.0.0"
          },
          "coreMigrationVersion": "8.4.0",
          "updated_at": "2022-06-17T18:16:11.538Z"
        }
      },
      {
        "_index": ".kibana_8.4.0_001",
        "_id": "ingest-outputs:a09a5397-7b9a-5a73-a622-e29f4c635658",
        "_score": 5.7339883,
        "_source": {
          "ingest-outputs": {
            "name": "default",
            "is_default": true,
            "is_default_monitoring": true,
            "type": "elasticsearch",
            "hosts": [
              "https://89a580b8af164e9d94dee28aed08d8b5.us-west2.gcp.elastic-cloud.com:443"
            ],
            "output_id": "fleet-default-output"

I could reproduce this locally:

add a preconfigured agent policy that used a non-default output (with fleet server package policy)
change the default output's host to something inaccessible
trigger Fleet setup by navigating to Fleet UI
check the contents of package-policies saved objects - I see the output refers the default output instead of the one set in agent policy
and see the fleet server enroll command indeed contains the es url for the default output, which is wrong

EDIT: I checked in a 8.2.3 cloud staging instance, and the same setup is there for package policies and outputs like in 8.4. Fleet Server comes up as healthy there. Maybe something changed on agent/fleet server side on where the output host is taken from?

juliaElastic commented 2 years ago

I think this state of outputs is not the real problem, on other pr cloud deployments the same setup works fine with healthy Fleet Server: https://github.com/elastic/kibana/pull/131322

Only my two prs have this issue: https://github.com/elastic/kibana/pull/134673 https://github.com/elastic/kibana/pull/134565

michalpristas commented 2 years ago

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

juliaElastic commented 2 years ago

could you extract logs for me? i still don't have access to read logs in cloud for some reason, already raised with IT

@michalpristas Here are the Agent logs from last 1 day. elastic_agent_and_apm_logs_134565.csv

I think one reason why this instance stopped is being out of memory, as the cloud-ci Integration Server has 512 MB RAM, and I tried to enroll 10k agents.

However I tried to start today a new ess cluster with oblt-cli and the fleet server does not start up at all on 8.4.0-SNAPSHOT. Agent logs of this instance: elastic_agent_logs_ess-sarxy-custom.csv

This might be the same issue: https://github.com/elastic/fleet-server/issues/1574

elastic / elastic-agent

Fleet Server unhealthy on pr cloud deployment #575

Agent Logs: