Open barkbay opened 6 months ago
@cmacknz could you help us assess what the problem might be? It is an older version of the stack, 8.1.3, that we are testing here that might have known issues. If that were the case we would be also glad for some guidance for which versions to mute our tests.
From the logs posted here, this is what I see:
connection refused
errors.Error: fleet-server failed: context canceled
is not a real error, it is us failing to filter this out because the internal contexts are expected to be cancelled during shutdown.I suspect that Fleet Server restarting is going to be the root cause here but I can't tell why it is shutting down in the logs.
Fleet Server doesn't actually depend on Kibana at all, it depends on Elasticsearch and information between Fleet Server and Kibana is exchanged via the .fleet indices.
Was Elasticsearch working as expected during this test? Are there any more Fleet Server logs besides those posted here?
Was Elasticsearch working as expected during this test?
Cluster was green with the following indices:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size sth
green open .apm-agent-configuration Phgbu1G_QuKRypuhfrlu5g 1 1 0 0 450b 225b false
green open .apm-custom-link IdCZhr3qRdOVN8NhDucCYg 1 1 0 0 450b 225b false
green open .ds-ilm-history-5-2024.04.29-000001 mSSaUyQMQcmlaa8G_PU8dA 1 1 6 0 37.2kb 18.6kb false
green open .fleet-agents-7 0Qw2CLFgS6Ols6CBQ2hXRQ 1 1 3 1 429kb 208.3kb false
green open .fleet-enrollment-api-keys-7 kCxvPPR6T7y9co0Tv6NNUg 1 1 2 0 13.2kb 6.6kb false
green open .geoip_databases _uqV82zNRZCZbpkQFPu8sw 1 1 33 0 62.3mb 31.1mb false
green open .kibana-event-log-8.1.3-000001 SCl-CCDPSxCTKKFvNyhxVQ 1 1 2 0 24.6kb 12.3kb false
green open .kibana_8.1.3_001 mbYdeM1-RKCDNPXsN_iybw 1 1 1739 998 19.1mb 9.5mb false
green open .kibana_security_session_1 Kl6U1-TzRHqeEPcdw7nInA 1 1 0 0 450b 225b false
green open .kibana_task_manager_8.1.3_001 7HnEovZaRt-MI7Ds-qcsFg 1 1 20 984 430.1kb 180.8kb false
green open .security-7 bkRgaX0eSN2LjRcumxkiHA 1 1 110 0 721.1kb 364.4kb false
green open .tasks njUPJ0uPSOy1htV2deDGzg 1 1 2 0 22.3kb 14.2kb false
I can't find any error in the logs.
Are there any more Fleet Server logs besides those posted here?
Sorry for the screenshot, I have some issues when I try to get an export:
Please note that Kibana has used my TZ (GMT+2) in the screenshot above
@michel-laterman does anything jump out in these 8.1.3 agent logs?
The "Fleet Server - Waiting in policy..." message indicates that the policy self-monitor is awaiting for the policy to be detected, but .fleet-policies
isn't a part of the indices listed in https://github.com/elastic/cloud-on-k8s/issues/7790
.fleet-policies
is written by the Fleet plugin in Kibana correct?
So the problem may be Kibana after all.
yes, i think it's written by kibana. but i'm not sure why there's no error logs from fleet-server to indicate the reason for the shutdown
I'll try to reproduce it in my dev env, with log level increased.
FWIW the last successful run was on Saturday 2024-04-27T23:06:08.133397756Z
, first failure is reported on Monday 2024-04-28
. There is no commit in ECK during that time frame 😕
I'll try to reproduce it in my dev env, with log level increased.
😞
I reproduced the problem in our CI, attempted to run the Agent diagnostic tool using https://github.com/elastic/eck-diagnostics/, but for some reasons there is no Agent diagnostic data in the archive 😞
I also attempted to enable debug logs without success, it does not seem to give us more logs, I'm a bit out of ideas...
We seem to hit a similar failure on 8.15.0-SNAPSHOT as well https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/538#018f5503-4478-47bc-a9f5-ad5f1178dbf3
I think I was able to repro this on my dev cluster, and I saw no container kills or pod restarts but on the node where the agent pods were running, I see the agent process being continously OOM killed
[Wed May 8 22:10:22 2024] agentbeat invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
[Wed May 8 22:10:22 2024] CPU: 7 PID: 64018 Comm: agentbeat Not tainted 6.1.75+ elastic/cloud-on-k8s#1
[Wed May 8 22:10:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
[Wed May 8 22:10:22 2024] Call Trace:
[Wed May 8 22:10:22 2024] <TASK>
[Wed May 8 22:10:22 2024] dump_stack_lvl+0x4a/0x70
[Wed May 8 22:10:22 2024] dump_header+0x52/0x250
[Wed May 8 22:10:22 2024] oom_kill_process+0x10a/0x220
[Wed May 8 22:10:22 2024] out_of_memory+0x3dc/0x5c0
[Wed May 8 22:10:22 2024] ? __wake_up_locked_key+0x52/0x80
[Wed May 8 22:10:22 2024] try_charge_memcg+0x827/0xa90
[Wed May 8 22:10:22 2024] charge_memcg+0x3f/0x1f0
[Wed May 8 22:10:22 2024] __mem_cgroup_charge+0x2b/0x80
[Wed May 8 22:10:22 2024] wp_page_copy+0x3a8/0xa40
[Wed May 8 22:10:22 2024] handle_mm_fault+0x84a/0x16b0
[Wed May 8 22:10:22 2024] do_user_addr_fault+0x271/0x4d0
[Wed May 8 22:10:22 2024] exc_page_fault+0x78/0xf0
[Wed May 8 22:10:22 2024] asm_exc_page_fault+0x22/0x30
[Wed May 8 22:10:22 2024] RIP: 0033:0x55c6379eea4e
[Wed May 8 22:10:22 2024] Code: 24 38 48 8b 7c 24 78 48 8d 05 ae 23 7b 09 e8 c9 22 90 ff 48 89 44 24 58 84 00 48 8b 94 24 98 00 00 00 48 8b b4 24 90 00 00 00 <48> 89 30 83 3d f8 6e 79 0d 00 74 02 eb 04 eb 14 66 90 e8 5b 69 96
[Wed May 8 22:10:22 2024] RSP: 002b:000000c0015bc330 EFLAGS: 00010246
[Wed May 8 22:10:22 2024] RAX: 000000c00363d048 RBX: 0000000000000088 RCX: 000000c00362dec0
[Wed May 8 22:10:22 2024] RDX: 000055c644855cc8 RSI: 000055c6402ca1a0 RDI: 0000000000000000
[Wed May 8 22:10:22 2024] RBP: 000000c0015bc430 R08: 0000000000000110 R09: ffffffffffffffff
[Wed May 8 22:10:22 2024] R10: 0000000000000000 R11: 000000000001b1e7 R12: ffffffffffffffff
[Wed May 8 22:10:22 2024] R13: 000055c6408e9de0 R14: 000000c000f024e0 R15: 0000000000000000
[Wed May 8 22:10:22 2024] </TASK>
[Wed May 8 22:10:22 2024] memory: usage 358400kB, limit 358400kB, failcnt 52936
[Wed May 8 22:10:22 2024] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[Wed May 8 22:10:22 2024] kmem: usage 7228kB, limit 9007199254740988kB, failcnt 0
[Wed May 8 22:10:22 2024] Memory cgroup stats for /kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c:
[Wed May 8 22:10:22 2024] anon 359591936
file 4096
kernel 7401472
kernel_stack 1343488
pagetables 4644864
sec_pagetables 0
percpu 35920
sock 0
vmalloc 65536
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 359559168
active_anon 32768
inactive_file 0
active_file 4096
unevictable 0
slab_reclaimable 223696
slab_unreclaimable 828072
slab 1051768
workingset_refault_anon 0
workingset_refault_file 32589
workingset_activate_anon 0
workingset_activate_file 258
workingset_restore_anon 0
workingset_restore_file 3
workingset_nodereclaim 0
pgscan 123548
pgsteal 34679
pgscan_kswapd 0
pgscan_direct 123548
pgsteal_kswapd 0
pgsteal_direct 34679
pgfault 5820413
pgmajfault 0
pgrefill 6278
pgactivate 6538
pgdeactivate 6278
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
[Wed May 8 22:10:22 2024] Tasks state (memory values in pages):
[Wed May 8 22:10:22 2024] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed May 8 22:10:22 2024] [ 48825] 0 48825 257 1 28672 0 -998 pause
[Wed May 8 22:10:22 2024] [ 48859] 0 48859 625 147 49152 0 -997 tini
[Wed May 8 22:10:22 2024] [ 48871] 0 48871 640830 18324 450560 0 -997 elastic-agent
[Wed May 8 22:10:22 2024] [ 49408] 0 49408 602607 44255 847872 0 -997 agentbeat
[Wed May 8 22:10:22 2024] [ 60311] 0 60311 584174 44946 847872 0 -997 agentbeat
[Wed May 8 22:10:22 2024] [ 63295] 0 63295 584238 46602 839680 0 -997 agentbeat
[Wed May 8 22:10:22 2024] [ 63384] 0 63384 584254 47449 843776 0 -997 agentbeat
[Wed May 8 22:10:22 2024] [ 64018] 0 64018 602687 47556 856064 0 -997 agentbeat
[Wed May 8 22:10:22 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,mems_allowed=0,oom_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c,task_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c/803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,task=agentbeat,pid=64018,uid=0
[Wed May 8 22:10:22 2024] Memory cgroup out of memory: Killed process 64018 (agentbeat) total-vm:2410748kB, anon-rss:69512kB, file-rss:120712kB, shmem-rss:0kB, UID:0 pgtables:836kB oom_score_adj:-997
It seems like the jump in memory usage actually occurs in 8.14.0 and not 8.13.0, summarizing some measurements from Slack:
8.13.0:
karthikeyanvalliyurnatt@KarthikeyansMBP î‚° ~/workspace/kvalliyurnatt/k8s-gitops-control-plane î‚° î‚ main ? î‚° kubectl top pod test-agent-system-int-sf6b-agent-6n7vm -n e2e-mercury  ✔  12985  11:20:01
NAME CPU(cores) MEMORY(bytes)
test-agent-system-int-sf6b-agent-6n7vm 83m 265Mi
8.14.0+
karthikeyanvalliyurnatt@KarthikeyansMBP î‚° ~/workspace/kvalliyurnatt/k8s-gitops-control-plane î‚° î‚ main ? î‚° kubectl top pod test-agent-system-int-vlpc-agent-28vhc -n e2e-mercury  ✔  12958  09:55:47
NAME CPU(cores) MEMORY(bytes)
test-agent-system-int-vlpc-agent-28vhc 171m 349Mi
The heap profiles in the agent diagnostics in the ECK build failures aren't particularly helpful, all 3 look like:
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
@cmacknz the original issue here was related to fleet server restarting, the issue I mentioned is related to the elastic agent, could this be two separate issues? just want to make sure.
Quite possibly yes, fleet server could also be OOMing as the root cause there, but all we have confirmed so far is that agent in 8.14+ seems to use more baseline memory (or more memory at startup) then it did before.
I can fork off the second problem into a second issue if that makes things clearer.
might be better until we know for sure the two issues are related.
Looking at the issue description, at least the symptom 2024-04-29T23:18:21.222389355Z - Failure with {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default]
is 100% explained by the OOMs we are observing. The process responsible for writing to that index is the one being killed.
I created https://github.com/elastic/elastic-agent/issues/4730 specifically about the agent memory usage increase, I am going to send this one back to the ECK repository to track the test failures.
As discussed on Slack, the best initial mitigation would be to increase the memory limit and confirm it fixes the tests. We have a similar issue we've detected in MKI which is probably the same root cause. https://github.com/elastic/elastic-agent/issues/4729
We have new failures, they are pretty consistent in the last days. The error message is slightly different as it is referring to logs-elastic_agent.filebeat-default
:
404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception}]}}
It is affecting:
TestAgentVersionUpgradeToLatest8x/ES_data_should_pass_validations
on 8.15.0-SNAPSHOT
TestAgentVersionUpgradeToLatest8x/ES_data_should_pass_validations
on 8.16.0-SNAPSHOT
TestFleetKubernetesIntegrationRecipe/ES_data_should_pass_validations
on 8.14.0
I had a quick look at one diag bundle and can't find anything obvious 😕 : , no OOMKilled Pods for example... Closest datasteams which have been created are:
.ds-logs-elastic_agent-default-2024.08.01-000001
.ds-logs-elastic_agent.fleet_server-default-2024.08.01-000001
Seeing the this frequently on our nightly pipelines https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/635#01914a0d-6b49-4854-b2c3-7efcef377733
r:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception}]}}
Test: TestFleetMode/Fleet_in_same_namespace_as_Agent/ES_data_should_pass_validations
@pierrehilbert @rdner @belimawr does this error rings a bell on your end?
@pierrehilbert @rdner @belimawr does this error rings a bell on your end?
It does not ring a bell for me :/.
Nothing on my end either. @rdner any idea?
I don't recall us changing anything related to index creation lately. Also, the fact it started affecting multiple versions at once suggests it's either a backported change or something else changed in the environment.
It looks like OOM is at least responsible for some of the failures. In https://github.com/elastic/cloud-on-k8s/pull/8021 I bumped the memory to 640Mi
now which at least in the builds I ran improved the situation. But this is probably worth investigating. Unfortunately we don't have more accurate memory metrics from the test runs due to a problem with our monitoring.
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/530:
I had a closer look at
TestFleetMode/Fleet_in_same_namespace_as_Agent
:2024-04-29T23:03:21.220427957Z
-TestFleetMode/Fleet_in_same_namespace_as_Agent/ES_data_should_pass_validations
starts2024-04-29T23:18:21.222389355Z
- Failure with{Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default]
Agents
The 3 Agents have the same exact same final log lines:
Fleet Server
Fleet server seems to restart quite often, last start is 13 minutes after the Pod has been created:
Looking at the logs of a previous container instance:
I think the question is what is the root cause of this
fleet-server failed: context canceled
?Kibana
On the Kibana side the container has been oomkilled once but seems to be ready way before Agent as expected:
In the Kibana logs we have a few
401
from23:02:39.664+00:00
to23:02:45.162+00:00
, which might be related to the operator trying to callhttps://test-agent-fleet-qj5h-kb-http.e2e-fo9kn-mercury.svc:5601/api/fleet/setup
(last401
. in the operator is at23:02:45.193Z
). Seems to be ok otherwise.