[Monitoring] Handle failed/unavailable CCS requests

igoristic commented 3 years ago

The queries which are used when accessing SM API are prepended with *: (on all .monitoring-* indices) when calling ES's _search API. This is to also access remote clusters` data, and is on by default. As a result when one of the remote clusters is not responding the UI stops working.

Possible workaround:

# 6.1 - 7.6
xpack.monitoring.ccs.enabled: false

# 7.7+
monitoring.ui.ccs.enabled: false

Looking at the code we do indeed use ignoreUnavailable: true which translates to skip_unavailable: true in ES, so I'm assuming this to be a bug rather than an enhancement. Since, the API call should still pass gracefully, and the UI should still work without any interruptions/blockers

elasticmachine commented 3 years ago

Pinging @elastic/stack-monitoring (Team:Monitoring)

jasonrhodes commented 3 years ago

Looking at the code we do indeed use ignoreUnavailable: true which translates to skip_unavailable: true in ES, so I'm assuming this to be a bug rather than an enhancement.

Do we have any hypotheses about the reason for the bug? What are the AC for this ticket specifically? Thanks!

igoristic commented 3 years ago

@jasonrhodes I'm still looking into this issue, but the underline cause is not very apparent. From testing I either get a 30s timeout or occasionally a 500.

My initial guess is that we are probably missing ignoreUnavailable call somewhere (perhaps in places where we now using the new ES client). Based on the related issues this might even be a regression

jasonrhodes commented 3 years ago

@igoristic can you update the ticket description to include some kind of AC for what the scope of this ticket is? I'm concerned about just adding a config option to turn this off until we fully understand why the requests are failing despite the ignore unavailable/skip_unavailable settings being in place. We should probably dig into that as much as possible before we decide on a fix for this. Feel free to pull others in to help think this through.

jasonrhodes commented 3 years ago

Based on what I see in #55157 we may want to address this by handling the failure gracefully and showing the cluster in an "unavailable" state (that's what I would expect from a monitoring tool)

I definitely wouldn't expect the UI to tell me that there is "No monitoring data found" and ask me to set up monitoring.

matschaffer commented 3 years ago

I was testing this out using the docker-compose set up @jguay provided and this may go beyond stack monitoring.

Just a basic CCS+local query from kibana also results in a 502 from kibana devtools.

Screenshot_2021_06_21_16_50

matschaffer commented 3 years ago

I did a basic test direct to ES and it returned in ~12 minutes which I'm guessing is far longer than any kibana timeout in the chain:

~/Downloads/mon_stuck_when_CCS_conn_stuck_7121                                                                                                                                                16:52:03
❯ curl -k -u elastic:changeme 'https://localhost:9200/*:.monitoring-es-*,.monitoring-es-*/_search'
{"took":690013,"timed_out":false,"num_reduce_phases":2,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"_clusters":{"total":2,"successful":1,"skipped":1},"hits":{"total":{"value":6057,"relation":"eq"},"max_score":1.0,"hits":[{"_index":".monitoring-es-7-mb-
[snip]%                                                                       
~/Downloads/mon_stuck_when_CCS_conn_stuck_7121 11m 29s

Not seeing anything on https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-remote-clusters.html but I'm guessing we'll need some way to tell ES to be shorter than the kibana timeouts to try to handle this gracefully.

matschaffer commented 3 years ago

So this morning, with everything still running, the UI works now.

Screenshot_2021_06_22_10_42

the iptables REJECT is still in place

❯ docker-compose exec --privileged -u root es1 iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
REJECT     tcp  --  anywhere             anywhere             tcp dpt:vrace reject-with icmp-port-unreachable

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
# Warning: iptables-legacy tables present, use iptables-legacy to see them

So I'd say what we're looking at here isn't so much that monitoring doesn't handle the unavailable requests but rather that the ES internal timeout for CCS appears to be far longer than kibanas.

matschaffer commented 3 years ago

Looks like https://github.com/elastic/elasticsearch/issues/34405 has some connection here.

Also due to https://github.com/elastic/elasticsearch/issues/32678 it seems unlikely kibana could set a query timeout to avoid just exploding.

matschaffer commented 3 years ago

In the mean time I've confirmed that the workaround of monitoring.ui.ccs.enabled: false does work as expected.

matschaffer commented 3 years ago

I found another workaround from @DaveCTurner in https://discuss.elastic.co/t/elasticsearch-ccs-client-get-timeout-when-remote-cluster-is-isolated-by-firewall/152019/7

# docker-compose.yaml
  es0:
    ...
    sysctls:
      - net.ipv4.tcp_retries2=6

Where es0 is the cluster we're hitting with kibana0, attempting CCS to the blocked cluster.

So far I can't seem to find out how to override the 30s timeout on /api/monitoring/v1/clusters which would be good to have handy I think.

At any rate, I think once I find those I'd like to close this issue in favor of the ES issue https://github.com/elastic/elasticsearch/issues/34405. If we can get ES handling the CCS partition more gracefully for any requests, all kibana apps should benefit.

DaveCTurner commented 3 years ago

FWIW today we officially recommend net.ipv4.tcp_retries2=5 which gives a timeout of ~6s. Setting it to 6 roughly doubles that to ~12s which we feel to be too long.

matschaffer commented 3 years ago

oh, thanks @DaveCTurner ! I'd still like to figure out how we can tune out the stack monitoring elasticsearch client in kibana (for example to 60s).

So long as that setting (once I find it) and monitoring.ui.ccs.enabled: false are easily available as workarounds I think we can call this issue done and focus on making ES more responsive when working with CCS networking difficulties.

matschaffer commented 2 years ago

I think we can close this now that https://github.com/elastic/elasticsearch/issues/74773 is also closed.

elastic / kibana

[Monitoring] Handle failed/unavailable CCS requests #100696