Closed igoristic closed 2 years ago
Pinging @elastic/stack-monitoring (Team:Monitoring)
Looking at the code we do indeed use
ignoreUnavailable: true
which translates toskip_unavailable: true
in ES, so I'm assuming this to be a bug rather than an enhancement.
Do we have any hypotheses about the reason for the bug? What are the AC for this ticket specifically? Thanks!
@jasonrhodes I'm still looking into this issue, but the underline cause is not very apparent. From testing I either get a 30s timeout or occasionally a 500
.
My initial guess is that we are probably missing ignoreUnavailable
call somewhere (perhaps in places where we now using the new ES client). Based on the related issues this might even be a regression
@igoristic can you update the ticket description to include some kind of AC for what the scope of this ticket is? I'm concerned about just adding a config option to turn this off until we fully understand why the requests are failing despite the ignore unavailable/skip_unavailable settings being in place. We should probably dig into that as much as possible before we decide on a fix for this. Feel free to pull others in to help think this through.
Based on what I see in #55157 we may want to address this by handling the failure gracefully and showing the cluster in an "unavailable" state (that's what I would expect from a monitoring tool)
I definitely wouldn't expect the UI to tell me that there is "No monitoring data found" and ask me to set up monitoring.
I was testing this out using the docker-compose set up @jguay provided and this may go beyond stack monitoring.
Just a basic CCS+local query from kibana also results in a 502 from kibana devtools.
I did a basic test direct to ES and it returned in ~12 minutes which I'm guessing is far longer than any kibana timeout in the chain:
~/Downloads/mon_stuck_when_CCS_conn_stuck_7121 16:52:03
❯ curl -k -u elastic:changeme 'https://localhost:9200/*:.monitoring-es-*,.monitoring-es-*/_search'
{"took":690013,"timed_out":false,"num_reduce_phases":2,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"_clusters":{"total":2,"successful":1,"skipped":1},"hits":{"total":{"value":6057,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".monitoring-es-7-mb-
[snip]%
~/Downloads/mon_stuck_when_CCS_conn_stuck_7121 11m 29s
Not seeing anything on https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-remote-clusters.html but I'm guessing we'll need some way to tell ES to be shorter than the kibana timeouts to try to handle this gracefully.
So this morning, with everything still running, the UI works now.
the iptables REJECT is still in place
❯ docker-compose exec --privileged -u root es1 iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
REJECT tcp -- anywhere anywhere tcp dpt:vrace reject-with icmp-port-unreachable
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
# Warning: iptables-legacy tables present, use iptables-legacy to see them
So I'd say what we're looking at here isn't so much that monitoring doesn't handle the unavailable requests but rather that the ES internal timeout for CCS appears to be far longer than kibanas.
Looks like https://github.com/elastic/elasticsearch/issues/34405 has some connection here.
Also due to https://github.com/elastic/elasticsearch/issues/32678 it seems unlikely kibana could set a query timeout to avoid just exploding.
In the mean time I've confirmed that the workaround of monitoring.ui.ccs.enabled: false
does work as expected.
I found another workaround from @DaveCTurner in https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/7
# docker-compose.yaml
es0:
...
sysctls:
- net.ipv4.tcp_retries2=6
Where es0 is the cluster we're hitting with kibana0, attempting CCS to the blocked cluster.
So far I can't seem to find out how to override the 30s timeout on /api/monitoring/v1/clusters
which would be good to have handy I think.
At any rate, I think once I find those I'd like to close this issue in favor of the ES issue https://github.com/elastic/elasticsearch/issues/34405. If we can get ES handling the CCS partition more gracefully for any requests, all kibana apps should benefit.
FWIW today we officially recommend net.ipv4.tcp_retries2=5
which gives a timeout of ~6s. Setting it to 6 roughly doubles that to ~12s which we feel to be too long.
oh, thanks @DaveCTurner ! I'd still like to figure out how we can tune out the stack monitoring elasticsearch client in kibana (for example to 60s).
So long as that setting (once I find it) and monitoring.ui.ccs.enabled: false
are easily available as workarounds I think we can call this issue done and focus on making ES more responsive when working with CCS networking difficulties.
I think we can close this now that https://github.com/elastic/elasticsearch/issues/74773 is also closed.
The queries which are used when accessing SM API are prepended with
*:
(on all.monitoring-*
indices) when calling ES's_search
API. This is to also access remote clusters` data, and is on by default. As a result when one of the remote clusters is not responding the UI stops working.Possible workaround:
Looking at the code we do indeed use
ignoreUnavailable: true
which translates toskip_unavailable: true
in ES, so I'm assuming this to be a bug rather than an enhancement. Since, the API call should still pass gracefully, and the UI should still work without any interruptions/blockersRelated: https://github.com/elastic/kibana/issues/57596 Related: https://github.com/elastic/kibana/issues/36323 Related: https://github.com/elastic/kibana/issues/82143