fleet-server "failed to fetch elasticsearch version" - ECK install on OpenShift isn't working

manas-suleman commented 1 week ago

Elasticsearch Version

Version: 8.15.2, Build: docker/98adf7bf6bb69b66ab95b761c9e5aadb0bb059a3/2024-09-19T10:06:03.564235954Z, JVM: 22.0.1

Installed Plugins

No response

Java Version

bundled

OS Version

OpenShift BareMetal

Problem Description

I have deployed ECK on OpenShift baremetal servers for POC. While I can get kibana dashboard, I cannot get fleet-server to start and work. I'm using default configuration (from these documentations https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-openshift-deploy-the-operator.html and https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elastic-agent-fleet-quickstart.html) for the most part with little modifications where needed.

these are my manifests:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-sample
spec:
  version: 8.15.2
  count: 1
  elasticsearchRef:
    name: "elasticsearch-sample"
  podTemplate:
    spec:
      containers:
      - name: kibana
        resources:
          limits:
            memory: 1Gi
            cpu: 1
  config:
    server.publicBaseUrl: "https://#######"
    xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-sample-es-http.elastic.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-sample-agent-http.elastic.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
      - name: apm
        version: latest
    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: elastic
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
        - name: fleet_server-1
          id: fleet_server-1
          package:
            name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: elastic
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 8.15.2
  nodeSets:
    - name: default
      count: 1
      config:
        node.store.allow_mmap: false
        index.store.type: niofs # https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html
---
apiVersion: apm.k8s.elastic.co/v1
kind: ApmServer
metadata:
  name:apm-server-sample
spec:
  version: 8.15.2
  count: 1
  elasticsearchRef:
    name: "elasticsearch-sample"
  kibanaRef: 
    name: kibana-sample
  podTemplate:
    spec:
      serviceAccountName: apm-server

Agent state: oc get agents

NAME                   HEALTH   AVAILABLE   EXPECTED   VERSION   AGE
elastic-agent-sample   green    3           3          8.15.2    138m
fleet-server-sample    red                  1          8.15.2    138m

oc describe agent fleet-server-sample

Name:         fleet-server-sample
Namespace:    elastic
Labels:       <none>
Annotations:  ###
API Version:  agent.k8s.elastic.co/v1alpha1
Kind:         Agent
Metadata: ###
Spec:
  Deployment:
    Pod Template:
      Metadata:
        Creation Timestamp:  <nil>
      Spec:
        Automount Service Account Token:  true
        Containers:                       <nil>
        Security Context:
          Run As User:         0
        Service Account Name:  elastic-agent
        Volumes:
          Name:  agent-data
          Persistent Volume Claim:
            Claim Name:  fleet-server-sample
    Replicas:            1
    Strategy:
  Elasticsearch Refs:
    Name:                elasticsearch-sample
  Fleet Server Enabled:  true
  Fleet Server Ref:
  Http:
    Service:
      Metadata:
      Spec:
    Tls:
      Certificate:
  Kibana Ref:
    Name:     kibana-sample
  Mode:       fleet
  Policy ID:  eck-fleet-server
  Version:    8.15.2
Status:
  Elasticsearch Associations Status:
    elastic/elasticsearch-sample:  Established
  Expected Nodes:                  1
  Health:                          red
  Kibana Association Status:       Established
  Observed Generation:             2
  Version:                         8.15.2
Events:
  Type     Reason                   Age                   From                                 Message
  ----     ------                   ----                  ----                                 -------
  Warning  AssociationError         138m (x5 over 138m)   agent-controller                     Association backend for elasticsearch is not configured
  Warning  AssociationError         138m (x9 over 138m)   agent-controller                     Association backend for kibana is not configured
  Normal   AssociationStatusChange  138m                  agent-es-association-controller      Association status changed from [] to [elastic/elasticsearch-sample: Established]
  Normal   AssociationStatusChange  138m                  agent-kibana-association-controller  Association status changed from [] to [Established]
  Warning  Delayed                  138m (x11 over 138m)  agent-controller                     Delaying deployment of Elastic Agent in Fleet Mode as Kibana is not available yet

fleet-server pod error logs (which is in CrashLoopBackoff):

{"log.level":"error","@timestamp":"2024-10-14T16:35:35.550Z","message":"failed to fetch elasticsearch version","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"@timestamp":"2024-10-14T16:35:35.55Z","ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","error.message":"dial tcp [::1]:9200: connect: connection refused","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-10-14T16:35:35.551Z","message":"Failed Elasticsearch output configuration test, using bootstrap values.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","error.message":"dial tcp [::1]:9200: connect: connection refused","output":{"hosts":["localhost:9200"],"protocol":"https","proxy_disable":false,"proxy_headers":{},"service_token":"#####","ssl":{"certificate_authorities":["/mnt/elastic-internal/elasticsearch-association/elastic/elasticsearch-sample/certs/ca.crt"],"verification_mode":"full"},"type":"elasticsearch"},"@timestamp":"2024-10-14T16:35:35.55Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:35.612Z","message":"panic: runtime error: invalid memory address or nil pointer dereference","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x55df2cba3217]","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"goroutine 279 [running]:","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).configFromUnits(0xc000002240, {0x55df2d489218, 0xc000486370})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:441 +0x97","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).start(0xc000002240, {0x55df2d489218, 0xc000486370})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:344 +0x51","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).reconfigure(0xc0002fd728?, {0x55df2d489218?, 0xc000486370?})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:387 +0x8d","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.013Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).Run.func5()","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.013Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:204 +0x5c5","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.148Z","message":"created by github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).Run in goroutine 1","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.148Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:162 +0x416","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.515Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":647},"message":"Component state changed fleet-server-default (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.515Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":665},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.516Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":665},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:45.612Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.logReturn","file.name":"cmd/run.go","file.line":162},"message":"2 errors occurred:\n\t* timeout while waiting for managers to shut down: no response from runtime manager, no response from vars manager\n\t* config manager: failed to initialize Fleet Server: context deadline exceeded\n\n","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
Error: 2 errors occurred:
    * timeout while waiting for managers to shut down: no response from runtime manager, no response from vars manager
    * config manager: failed to initialize Fleet Server: context deadline exceeded

From the logs it appears that fleet-server pod is looking for elasticsearch cluster at localhost instead of sending requests to elasticsearch service. There are other errors as well but I think this needs to be resolved first.

Errors in kibana pod:

[2024-10-14T16:17:47.714+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. Request timed out

Steps to Reproduce

Deploy ECK cluster using manifests mentioned above. Which are default for the most part with some changes.

Logs (if relevant)

No response

gwbrown commented 6 days ago

I think this issue is more appropriate for the ECK repo rather than the Elasticsearch repo, so I'll move this there. Let me know if there's an underlying issue with Elasticsearch here.

barkbay commented 4 days ago

I didn't manage to reproduce the problem using the provided Elasticsearch and Kibana manifests and the following versions:

ECK 2.14.0
OpenShift 4.17.0 / K8s v1.30.4

I would first try to understand the connectivity issue between Kibana and Elasticsearch. Could you check that the ES cluster is healthy, all the Pods are running and ready and check if there is anything suspicious in the ES logs?

FWIW here is the Agent definition I've been using for my test:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-sample
  namespace: elastic
spec:
  version: 8.15.2
  kibanaRef:
    name: kibana-sample
  elasticsearchRefs:
    - name: elasticsearch-sample
  mode: fleet
  fleetServerEnabled: true
  policyID: eck-fleet-server
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
        containers:
          - name: agent
            securityContext:
              privileged: true

With the following command to add the service account to the privileged SCC:

oc adm policy add-scc-to-user privileged -z fleet-server -n elastic

manas-suleman commented 17 hours ago

Hi @barkbay,

Thanks for your reply. I was basically using the same fleet-server definition except the privileged security context. I've now tried with privileged:true and the issue isn't resolved unfortunately. I've done some more troubleshooting and found some more potential clues as to what could be the problem:

oc get kibana

NAME            HEALTH   NODES   VERSION   AGE
kibana-sample   green    1       8.15.2    6d20h

kibana pod error logs:

[2024-10-20T14:20:20.005+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. Request timed out
[2024-10-20T14:42:46.570+00:00][ERROR][plugins.security.authentication] Authentication attempt failed:
{
  "error": {
    "root_cause": [
      {
        "type": "security_exception",
        "reason": "unable to authenticate user [elastic-fleet-server-sample-agent-kb-user] for REST request [/_security/_authenticate]",
        "header": {
          "WWW-Authenticate": [
            "Basic realm=\"security\", charset=\"UTF-8\"",
            "Bearer realm=\"security\"",
            "ApiKey"
          ]
        }
      }
    ],
    "type": "security_exception",
    "reason": "unable to authenticate user [elastic-fleet-server-sample-agent-kb-user] for REST request [/_security/_authenticate]",
    "header": {
      "WWW-Authenticate": [
        "Basic realm=\"security\", charset=\"UTF-8\"",
        "Bearer realm=\"security\"",
        "ApiKey"
      ]
    }
  },
  "status": 401
}

either of these two logs seems to be the root-cause for this error. is it possible that the first log is causing auth fail? as to find the cause of first log, it appears something's wrong with elasticsearch so tried troubleshooting that. oc get elasticsearch

NAME                   HEALTH   NODES   VERSION   PHASE   AGE
elasticsearch-sample   yellow   1       8.15.2    Ready   6d21h

oc describe elasticsearch

...
Status:
  Available Nodes:  1
  Conditions:
    Last Transition Time:  2024-10-19T21:39:52Z
    Status:                True
    Type:                  ReconciliationComplete
    Last Transition Time:  2024-10-14T14:10:27Z
    Message:               All nodes are running version 8.15.2
    Status:                True
    Type:                  RunningDesiredVersion
    Last Transition Time:  2024-10-19T21:39:52Z
    Message:               Service elastic/elasticsearch-sample-es-internal-http has endpoints
    Status:                True
    Type:                  ElasticsearchIsReachable
    Last Transition Time:  2024-10-14T16:08:39Z
    Message:               Cannot get compute and storage resources from Elasticsearch resource generation 3: cannot compute resources for NodeSet "default": no CPU request or limit set
    Status:                False
    Type:                  ResourcesAwareManagement
  Health:                  yellow
Events:                              <none>

elasticsearch error logs:

{
  "@timestamp": "2024-10-19T21:39:48.160Z",
  "log.level": "ERROR",
  "message": "exception during geoip databases update",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[elasticsearch-sample-es-default-0][generic][T#4]",
  "log.logger": "org.elasticsearch.ingest.geoip.GeoIpDownloader",
  "elasticsearch.cluster.uuid": "d_KKluHQS2Ohfrx6aJn1SA",
  "elasticsearch.node.id": "hUz6RUtXTmep87LFE_FNkQ",
  "elasticsearch.node.name": "elasticsearch-sample-es-default-0",
  "elasticsearch.cluster.name": "elasticsearch-sample",
  "error.type": "org.elasticsearch.ElasticsearchException",
  "error.message": "not all primary shards of [.geoip_databases] index are active",
  "error.stack_trace": "org.elasticsearch.ElasticsearchException: not all primary shards of [.geoip_databases] index are active\n\tat org.elasticsearch.ingest.geoip@8.15.2/org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:141)\n\tat org.elasticsearch.ingest.geoip@8.15.2/org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:293)\n\tat org.elasticsearch.ingest.geoip@8.15.2/org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:162)\n\tat org.elasticsearch.ingest.geoip@8.15.2/org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:61)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:34)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1570)\n"
}
{
  "@timestamp": "2024-10-19T21:39:50.271Z",
  "log.level": "WARN",
  "message": "path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}, status: 503",
  "ecs.version": "1.2.0",
  "service.name": "ES_ECS",
  "event.dataset": "elasticsearch.server",
  "process.thread.name": "elasticsearch[elasticsearch-sample-es-default-0][system_read][T#1]",
  "log.logger": "rest.suppressed",
  "trace.id": "c819660c9f92837d8de985ed7fb51b84",
  "elasticsearch.cluster.uuid": "d_KKluHQS2Ohfrx6aJn1SA",
  "elasticsearch.node.id": "hUz6RUtXTmep87LFE_FNkQ",
  "elasticsearch.node.name": "elasticsearch-sample-es-default-0",
  "elasticsearch.cluster.name": "elasticsearch-sample",
  "error.type": "org.elasticsearch.action.search.SearchPhaseExecutionException",
  "error.message": "all shards failed",
  "error.stack_trace": "Failed to execute phase [query], all shards failed; shardFailures {[hUz6RUtXTmep87LFE_FNkQ][.kibana_task_manager_8.15.2_001][0]: org.elasticsearch.action.NoShardAvailableActionException: [elasticsearch-sample-es-default-0][20.128.0.208:9300][indices:data/read/search[phase/query]]\n}\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:724)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:416)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:756)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:509)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:337)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:53)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:677)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleException(TransportService.java:1766)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1490)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1624)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1599)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:44)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:44)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:151)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:967)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1570)\nCaused by: org.elasticsearch.action.NoShardAvailableActionException: [elasticsearch-sample-es-default-0][20.128.0.208:9300][indices:data/read/search[phase/query]]\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.NoShardAvailableActionException.forOnShardFailureWrapper(NoShardAvailableActionException.java:28)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:544)\n\tat org.elasticsearch.server@8.15.2/org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:491)\n\t... 18 more\n"
}

for the first log, according to the post https://discuss.elastic.co/t/not-all-primary-shards-of-geoip-databases-index-are-active/324401 , it appears this issue should go away on its own but it doesn't. other posts suggest it may have to do with limited resources on the servers but OpenShift servers have more than plenty of available cpu and memory to be used.

Finally, latest fleet server error logs:

{
  "log.level": "error",
  "@timestamp": "2024-10-21T14:15:27.979Z",
  "message": "failed to fetch elasticsearch version",
  "component": {
    "binary": "fleet-server",
    "dataset": "elastic_agent.fleet_server",
    "id": "fleet-server-default",
    "type": "fleet-server"
  },
  "log": {
    "source": "fleet-server-default"
  },
  "service.type": "fleet-server",
  "error.message": "dial tcp [::1]:9200: connect: connection refused",
  "ecs.version": "1.6.0",
  "service.name": "fleet-server"
}
{
  "log.level": "error",
  "@timestamp": "2024-10-21T14:15:28.96Z",
  "message": "failed to fetch elasticsearch version",
  "component": {
    "binary": "fleet-server",
    "dataset": "elastic_agent.fleet_server",
    "id": "fleet-server-default",
    "type": "fleet-server"
  },
  "log": {
    "source": "fleet-server-default"
  },
  "ecs.version": "1.6.0",
  "service.name": "fleet-server",
  "service.type": "fleet-server",
  "error.message": "dial tcp <load-balancer-IP>:9200: connect: no route to host"
}
{
  "log.level": "error",
  "@timestamp": "2024-10-21T14:15:28.962Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents",
    "file.name": "coordinator/coordinator.go",
    "file.line": 665
  },
  "message": "Unit state changed fleet-server-default-fleet-server (STARTING->FAILED): Error - failed version compatibility check with elasticsearch: dial tcp <load-balancer-IP>:9200: connect: no route to host",
  "log": {
    "source": "elastic-agent"
  },
  "component": {
    "id": "fleet-server-default",
    "state": "HEALTHY"
  },
  "unit": {
    "id": "fleet-server-default-fleet-server",
    "type": "input",
    "state": "FAILED",
    "old_state": "STARTING"
  },
  "ecs.version": "1.6.0"
}
{
  "log.level": "error",
  "@timestamp": "2024-10-21T14:17:09.078Z",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent/internal/pkg/agent/cmd.logReturn",
    "file.name": "cmd/run.go",
    "file.line": 162
  },
  "message": "1 error occurred:\n\t* timeout while waiting for managers to shut down: no response from vars manager\n\n",
  "log": {
    "source": "elastic-agent"
  },
  "ecs.version": "1.6.0"
}
Error: 1 error occurred:
    * timeout while waiting for managers to shut down: no response from vars manager

many other fleet-server errors appear to have gone but "failed to get elasticsearch version" and similar seem to be persistent. fleetserver pod is still in crashloopbackoff state. elasticsearch pod is running and elasticsearch service exists so not sure why fleet-server cannot reach elasticsearch cluster. This feels more ECK issue than OpenShift/Infrastructure issue as I don't have this problem with any other apps. Can you please nudge me as to what could be the actual problem here?

elastic / cloud-on-k8s