elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.59k stars 702 forks source link

Quickstart guide on beats with ECK does not work #3594

Closed thepotatocannon closed 4 years ago

thepotatocannon commented 4 years ago

Bug Report

I used quickstart deployments to test Filebeat on ECK. All the yaml files were taken from documentation. While Kibana and Elasticsearch both work fine, Filebeats cannot connect to elasticsearch.
Operator version is 1.2.0, ECK stack uses 7.8.1, kubectl version results:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.10", GitCommit:"f3add640dbcd4f3c33a7749f38baaac0b3fe810d", GitTreeState:"clean", BuildDate:"2020-05-20T14:00:52Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

I don't use any specific cloud provider.

Logs from beats:

2020-08-06T14:02:14.400Z    WARN    [transport] transport/tcp.go:52 DNS lookup failure "quickstart-es-http.elastic-system.svc": lookup quickstart-es-http.elastic-system.svc on 10.96.0.10:53: read udp 10.10.1.50:54114->10.96.0.10:53: i/o timeout
2020-08-06T14:02:16.333Z    ERROR   [publisher_pipeline_output] pipeline/output.go:155  Failed to connect to backoff(elasticsearch(https://quickstart-es-http.elastic-system.svc:9200)): Get https://quickstart-es-http.elastic-system.svc:9200: lookup quickstart-es-http.elastic-system.svc on 10.96.0.10:53: read udp 10.10.1.50:54114->10.96.0.10:53: i/o timeout
2020-08-06T14:02:16.333Z    INFO    [publisher_pipeline_output] pipeline/output.go:146  Attempting to reconnect to backoff(elasticsearch(https://quickstart-es-http.elastic-system.svc:9200)) with 1 reconnect attempt(s)
2020-08-06T14:02:16.333Z    INFO    [publisher] pipeline/retry.go:221   retryer: send unwait signal to consumer
2020-08-06T14:02:16.333Z    INFO    [publisher] pipeline/retry.go:225     done

Any ideas what could go wrong?

barkbay commented 4 years ago

DNS lookup failure "quickstart-es-http.elastic-system.svc": lookup quickstart-es-http.elastic-system.svc on 10.96.0.10:53: read udp 10.10.1.50:54114->10.96.0.10:53: i/o timeout

Looks like a DNS failure, could you confirm that the Service quickstart-es-http exists in the namespace elastic-system and check that you can resolve quickstart-es-http.elastic-system.svc to the cluster IP from a Pod in your cluster ?

thepotatocannon commented 4 years ago

DNS lookup failure "quickstart-es-http.elastic-system.svc": lookup quickstart-es-http.elastic-system.svc on 10.96.0.10:53: read udp 10.10.1.50:54114->10.96.0.10:53: i/o timeout

Looks like a DNS failure, could you confirm that the Service quickstart-es-http exists in the namespace elastic-system and check that you can resolve quickstart-es-http.elastic-system.svc to the cluster IP from a Pod in your cluster ?

Sorry for the delay. The service exists in the given namespace, what's more, I created a busybox pod, issued nslookup command with that service and addresses where successfully resolved. Is there any possibility that this is connected with current beats image/setup?

EDIT: Every other application on this cluster works perfectly fine, also kibana can resolve the very same DNS address for elastic.

barkbay commented 4 years ago

I'm wondering if it might come from the dnsPolicy: ClusterFirstWithHostNet from the quickstart. Could you deploy the busybox Pod again with the dnsPolicy set to ClusterFirstWithHostNet and see if you get the same error ?

Edit: You can also try with hostNetwork: true

thepotatocannon commented 4 years ago

Update: It was definitely because of the parameter hostNetwork (changing dnsPolicy didn't make any difference in behavior). The problem was, when I commented hostNetwork: true, no data could be collected from nodes. I ended up switching to a different cluster and deployed whole stack without changes, the DNS failure is gone, but I don't see any logs in kibana, nor the index created in elasticsearch. Any ideas on how to investigate it further?

EDIT: It might be related to authorization problems:

2020-08-12T11:53:39.580Z    INFO    instance/beat.go:647    Home path: [/usr/share/filebeat] Config path: [/usr/share/filebeat] Data path: [/usr/share/filebeat/data] Logs path: [/usr/share/filebeat/logs]
2020-08-12T11:53:39.580Z    INFO    instance/beat.go:655    Beat ID: edb1315c-0503-4712-a566-4904890912a1
2020-08-12T11:53:39.781Z    INFO    [seccomp]   seccomp/seccomp.go:124  Syscall filter successfully installed
2020-08-12T11:53:39.781Z    INFO    [beat]  instance/beat.go:983    Beat info   {"system_info": {"beat": {"path": {"config": "/usr/share/filebeat", "data": "/usr/share/filebeat/data", "home": "/usr/share/filebeat", "logs": "/usr/share/filebeat/logs"}, "type": "filebeat", "uuid": "edb1315c-0503-4712-a566-4904890912a1"}}}
2020-08-12T11:53:39.781Z    INFO    [beat]  instance/beat.go:992    Build info  {"system_info": {"build": {"commit": "94f7632be5d56a7928595da79f4b829ffe123744", "libbeat": "7.8.1", "time": "2020-07-21T15:12:45.000Z", "version": "7.8.1"}}}
2020-08-12T11:53:39.781Z    INFO    [beat]  instance/beat.go:995    Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":8,"version":"go1.13.10"}}}
2020-08-12T11:53:39.786Z    INFO    [beat]  instance/beat.go:999    Host info   {"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-07-14T09:21:17Z","containerized":true,"name":"kube10-dev01-worker02","ip":["127.0.0.1/8","::1/128","10.10.1.23/24","fe80::f816:3eff:fea6:2ea1/64","172.17.0.1/16","10.244.12.192/32","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64","fe80::ecee:eeff:feee:eeee/64"],"kernel_version":"3.10.0-1127.13.1.el7.x86_64","mac":["fa:16:3e:a6:2e:a1","02:42:6e:7c:b0:7e","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":8,"patch":2003,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0,"id":"1a018e03a49f4bfc904c69b0d6c08959"}}}
2020-08-12T11:53:39.787Z    INFO    [beat]  instance/beat.go:1028   Process info    {"system_info": {"process": {"capabilities": {"inheritable":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"permitted":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"effective":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/filebeat", "exe": "/usr/share/filebeat/filebeat", "name": "filebeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-08-12T11:53:37.650Z"}}}
2020-08-12T11:53:39.787Z    INFO    instance/beat.go:310    Setup Beat: filebeat; Version: 7.8.1
2020-08-12T11:53:39.787Z    INFO    [index-management]  idxmgmt/std.go:184  Set output.elasticsearch.index to 'filebeat-7.8.1' as ILM is enabled.
2020-08-12T11:53:39.878Z    INFO    eslegclient/connection.go:99    elasticsearch url: https://test-es-http.elastic-system.svc:9200
2020-08-12T11:53:39.879Z    INFO    [publisher] pipeline/module.go:113  Beat name: kube10-dev01-worker02
2020-08-12T11:53:39.880Z    INFO    [monitoring]    log/log.go:118  Starting metrics logging every 30s
2020-08-12T11:53:39.880Z    INFO    instance/beat.go:463    filebeat start running.
2020-08-12T11:53:39.880Z    INFO    registrar/registrar.go:145  Loading registrar data from /usr/share/filebeat/data/registry/filebeat/data.json
2020-08-12T11:53:39.880Z    INFO    registrar/registrar.go:152  States Loaded from registrar: 0
2020-08-12T11:53:39.880Z    INFO    [crawler]   beater/crawler.go:71    Loading Inputs: 0
2020-08-12T11:53:39.880Z    INFO    [crawler]   beater/crawler.go:108   Loading and starting Inputs completed. Enabled inputs: 0
2020-08-12T11:53:39.881Z    WARN    [cfgwarn]   kubernetes/config.go:80 DEPRECATED: `host` will be deprecated, use `node` instead Will be removed in version: 8.0
2020-08-12T11:53:39.979Z    WARN    [cfgwarn]   kubernetes/config.go:80 DEPRECATED: `host` will be deprecated, use `node` instead Will be removed in version: 8.0
2020-08-12T11:53:39.979Z    INFO    [autodiscover.pod]  kubernetes/util.go:79   kubernetes: Using node kube10-dev01-worker02 provided in the config
2020-08-12T11:53:39.979Z    INFO    [autodiscover]  autodiscover/autodiscover.go:113    Starting autodiscover manager
2020-08-12T11:53:40.282Z    INFO    log/input.go:152    Configured paths: [/var/log/containers/*9ac77d2306eea57c4ea2b46b45d6f19801487318351792b97261dc25bfb8b37f.log]
2020-08-12T11:53:40.283Z    INFO    log/input.go:152    Configured paths: [/var/log/containers/*9ac77d2306eea57c4ea2b46b45d6f19801487318351792b97261dc25bfb8b37f.log]
2020-08-12T11:53:40.285Z    INFO    log/input.go:152    Configured paths: [/var/log/containers/*d51dbe98266811fbeb3800a2cc38aa35a0688c386865e942874bb3ccdc0102b7.log]
2020-08-12T11:53:40.579Z    INFO    log/input.go:152    Configured paths: [/var/log/containers/*d51dbe98266811fbeb3800a2cc38aa35a0688c386865e942874bb3ccdc0102b7.log]
2020-08-12T11:53:40.579Z    INFO    eslegclient/connection.go:99    elasticsearch url: https://test-es-http.elastic-system.svc:9200
2020-08-12T11:53:40.817Z    ERROR   [esclientleg]   eslegclient/connection.go:261   error connecting to Elasticsearch at https://test-es-http.elastic-system.svc:9200: 401 Unauthorized: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [elastic-system-filebeat-beat-user] for REST request [/]","header":{"WWW-Authenticate":["Bearer realm=\"security\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}}],"type":"security_exception","reason":"unable to authenticate user [elastic-system-filebeat-beat-user] for REST request [/]","header":{"WWW-Authenticate":["Bearer realm=\"security\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}},"status":401}
2020-08-12T11:53:40.817Z    ERROR   fileset/factory.go:134  Error loading pipeline: Error creating Elasticsearch client: couldn't connect to any of the configured Elasticsearch hosts. Errors: [error connecting to Elasticsearch at https://test-es-http.elastic-system.svc:9200: 401 Unauthorized: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [elastic-system-filebeat-beat-user] for REST request [/]","header":{"WWW-Authenticate":["Bearer realm=\"security\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}}],"type":"security_exception","reason":"unable to authenticate user [elastic-system-filebeat-beat-user] for REST request [/]","header":{"WWW-Authenticate":["Bearer realm=\"security\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}},"status":401}]
barkbay commented 4 years ago

I ended up switching to a different cluster and deployed whole stack without changes, the DNS failure is gone, but I don't see any logs in kibana, nor the index created in elasticsearch.

We are trying to keep Github issues only for actual bug reports and feature requests. It seems that your issue is related to your environment. We would be happy to help but can you create a topic in our discuss forum at https://discuss.elastic.co/c/eck. Also please provide more details about your environment (on prem., EKS, GKE, OCP ?), the operator logs, the health of the cluster...

Thank you.

meiry commented 3 years ago

@thepotatocannon
I'm having the same issue when deploying in Kubernetes did you managed to fix it ?

thepotatocannon commented 3 years ago

When it comes to the first issue we whad the problem connected to DNS settings. Speaking of authorization, for the purpose of tests, we simply turned off https. Finally, the problem related to no logs in elastic/kibana, it turned out that container logs in our environment were saved in a different directory, so mounting the right one from workers was enough.

meiry commented 3 years ago

@thepotatocannon how did you turend off https ? How did you find the error about the mounting? the kibana and filebeats logs don't show anything related to any thing ...

thepotatocannon commented 3 years ago

I don't remember how but I am pretty sure it is somewhere in the documentation. As it comes to logs, I first checked files in filebeat containter and there were symbolic links to files containing logs but the container itself didn't have the access to the actual files, that's why mounting it was enough.

meiry commented 3 years ago

@thepotatocannon Ok did the checking log thing and they indid there and i can access them and see them with no problem about the http you mean to turn off in the elasticsearch cluster level right? so i could access it like this : curl -u "elastic:xxxxxxxx" -k "http://my-cluster-es-http:9200"

thepotatocannon commented 3 years ago

probably yes, this is what we did, but honestly I cannot remember exactly now.