elastic / eck-diagnostics

Diagnostic tooling for ECK installations
Other
1 stars 15 forks source link

failed to capture cluster diagnostic #54

Open milanage opened 3 years ago

milanage commented 3 years ago

Tried to capture an ECK diags - the command succeeded and we got the tar ball but it seems cluster diagnostic failed (the ECK dump part was correctly captured).

In eck-diagnostic-errors.txt

Delete "https://xxxxxxx.xxx.us-west-2.eks.amazonaws.com/api/v1/namespaces/abc-namespace/pods/xxx-elasticsearch-elasticsearch-diag": net/http: TLS handshake timeout

in eck-diagnostics.log

2021/10/13 19:03:28 ECK diagnostics with parameters: {DiagnosticImage:docker.elastic.co/eck-dev/support-diagnostics:8.1.4 ECKVersion: Kubeconfig: OperatorNamespaces:[elastic-system] ResourcesNamespaces:[abc-namespace] OutputDir: RunStackDiagnostics:true Verbose:false}
2021/10/13 19:03:54 Extracting Kubernetes diagnostics from elastic-system
2021/10/13 19:04:25 ECK version is 1.6.0
2021/10/13 19:04:25 Extracting Kubernetes diagnostics from abc-namespace
2021/10/13 19:58:46 Kibana diagnostics extracted for abc-namespace/xxx-kibana-external

in kibana diagnostics.log

23:57:41.774 [main] INFO  com.elastic.support.BaseService - Diagnostic logger reconfigured for inclusion into archive
23:57:41.776 [main] INFO  com.elastic.support.diagnostics.commands.CheckKibanaVersion - Getting Kibana Version.
23:58:41.875 [main] ERROR com.elastic.support.rest.RestClient - Unexpected Execution Error
org.apache.http.conn.ConnectTimeoutException: Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/172.20.50.239] failed: connect timed out
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) ~[httpclient-4.5.10.jar:4.5.10]
    at com.elastic.support.rest.RestClient.execRequest(RestClient.java:73) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.rest.RestClient.execGet(RestClient.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.rest.RestClient.execQuery(RestClient.java:58) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.commands.CheckKibanaVersion.getKibanaVersion(CheckKibanaVersion.java:95) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.commands.CheckKibanaVersion.execute(CheckKibanaVersion.java:64) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.chain.DiagnosticChainExec.runDiagnostic(DiagnosticChainExec.java:111) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.DiagnosticService.exec(DiagnosticService.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.DiagnosticApp.main(DiagnosticApp.java:42) [support-diagnostics-8.1.4.jar:8.1.4]
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?]
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?]
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403) ~[?:?]
    at java.net.Socket.connect(Socket.java:591) ~[?:?]
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368) ~[httpclient-4.5.10.jar:4.5.10]
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[httpclient-4.5.10.jar:4.5.10]
    ... 16 more
23:58:41.882 [main] ERROR com.elastic.support.diagnostics.commands.CheckKibanaVersion - Unanticipated error:
java.lang.RuntimeException: Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/xxx.xx.xx.xxx] failed: connect timed out
    at com.elastic.support.rest.RestClient.execRequest(RestClient.java:79) ~[support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.rest.RestClient.execGet(RestClient.java:68) ~[support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.rest.RestClient.execQuery(RestClient.java:58) ~[support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.commands.CheckKibanaVersion.getKibanaVersion(CheckKibanaVersion.java:95) ~[support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.commands.CheckKibanaVersion.execute(CheckKibanaVersion.java:64) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.chain.DiagnosticChainExec.runDiagnostic(DiagnosticChainExec.java:111) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.DiagnosticService.exec(DiagnosticService.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
    at com.elastic.support.diagnostics.DiagnosticApp.main(DiagnosticApp.java:42) [support-diagnostics-8.1.4.jar:8.1.4]
23:58:41.882 [main] ERROR com.elastic.support.diagnostics.DiagnosticService - Could't retrieve Kibana version due to a system or network error. Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/xxx.xx.xx.xxx] failed: connect timed out
Check diagnostics.log in the archive file for more detail.
23:58:41.883 [main] INFO  com.elastic.support.BaseService - Closing loggers.
23:58:41.883 [main] INFO  com.elastic.support.BaseService - Archiving diagnostic results.

Is there any other flag that we need to specify apart from -o -r?

kunisen commented 3 years ago

It seems the timeout is 1 minute when grabbing Kibana diag.

23:57:41.776 [main] INFO  com.elastic.support.diagnostics.commands.CheckKibanaVersion - Getting Kibana Version.
23:58:41.875 [main] ERROR com.elastic.support.rest.RestClient - Unexpected Execution Error

Not very sure if it's a pure timeout issue yet, but given it's hard to tweak timeout value as of now, due to it's not exposed as parameter. Could we please first expose this option to external and see if by simply tweaking timeout value can solve the issue?

Or alternatively maybe we can use 5 minutes by default, but make it tuneable + default a bit longer may be better, based on the situation.

pebrc commented 3 years ago

These are timeouts that are defaulted in the stack diagnostics tool not in eck-diagnostics https://github.com/elastic/support-diagnostics/blob/bad8fe76f2d2be716c14ffc5455f8fb51d78d280/src/main/resources/diags.yml#L24-L30

which are read from the class path so I think we would have to either rebuild the support-diagnostics tool with different settings or inject a different configuration file into the JVM class path.

The other question is maybe: do we have hope that if we would wait longer the Kibana Diagnostics extraction would have been successful?

milanage commented 3 years ago

I'm not sure about the Kibana diagnostics part but we attempted an ES diagnostics (same API mode) and it was successful. The uncompressed size of the ES diagnostics is quite large (~660MB, with a 108MB cluster_state.json). I guess the failure could be related to the large size? But on the other hand, if the standalone diag-tool and the one in eck-diagnostics do the exact same thing, why was the different outcome?