Open def- opened 1 month ago
default/environmentd-0[environmentd]: /usr/local/bin/entrypoint.sh: line 19: 14 Killed environmentd --sql-listen-addr=0.0.0.0:6875 --http-listen-addr=0.0.0.0:6876 "$@"
"Killed" is indicative of a SIGKILL, which generally only occurs on OOM. Is it possible we just ran out of memory during this test?
It doesn't seem like a test that would consume a lot of memory
The test itself wouldn't, but the infrastructure might—it's Kubernetes, Cockroach, MinIO, a bunch of controllers... that might add up.
In that case we would also see this crash in other places so this doesn't fully explain why we see it only in test_missing_secret after deleting the k8s secret.
Yeah, I agree that's very suspicious. But I've also never seen Killed
indicate anything besides an OOM kill (or someone explicitly sending a SIGKILL).
Maybe something about the test is causing a large chunk of memory to get allocated which then trips the OOM killer.
I doubled the agent memory, and this happened again nevertheless: https://buildkite.com/materialize/test/builds/88993#01917c05-b42e-4800-b6ca-ef38f1f04136 So out of memory does not make sense to me. I will disable the test for now to unflake main.
Even though we don't see a Killed
this time, we do see exit code 137, which means that someone sent a SIGKILL.
Name: environmentd-0
Namespace: default
Priority: 0
Node: mzcloud-worker5/172.16.16.8
Start Time: Thu, 22 Aug 2024 22:09:23 +0000
Labels: app=environmentd
apps.kubernetes.io/pod-index=0
controller-revision-hash=environmentd-7c8976bf76
statefulset.kubernetes.io/pod-name=environmentd-0
Annotations: <none>
Status: Running
IP: 10.244.3.4
IPs:
IP: 10.244.3.4
Controlled By: StatefulSet/environmentd
Containers:
environmentd:
Container ID: containerd://8fbbcf033b6180e93a37c891de483846eb033349bf91a633856bdc89c7a8836a
Image: materialize/environmentd:mzbuild-RC3RXXTODEO37ECPIWGIZWTX3MSERBA6
Image ID: docker.io/library/import-2024-08-22@sha256:5bb4d346cda2e13e853186d6c0b0a1c1977f6bfcf4b55e7eb0e4bb867f36f0c4
Port: 5432/TCP
Host Port: 0/TCP
Args:
--availability-zone=1
--availability-zone=2
--availability-zone=3
--aws-account-id=123456789000
--aws-external-id-prefix=eb5cb59b-e2fe-41f3-87ca-d2176a495345
--announce-egress-ip=1.2.3.4
--announce-egress-ip=88.77.66.55
--environment-id=cloudtest-test-00000000-0000-0000-0000-000000000000-0
--persist-blob-url=s3://minio:minio123@persist/persist?endpoint=http%3A//minio-service.default%3A9000®ion=minio
--orchestrator=kubernetes
--orchestrator-kubernetes-image-pull-policy=if-not-present
--orchestrator-kubernetes-service-fs-group=999
--persist-consensus-url=postgres://root@cockroach.default:26257?options=--search_path=consensus
--internal-sql-listen-addr=0.0.0.0:6877
--internal-http-listen-addr=0.0.0.0:6878
--unsafe-mode
--clusterd-image
materialize/clusterd:mzbuild-P4H6HNNIR5ZMCJSNJCSLOONTFVMTGGXG
--bootstrap-role
materialize
--internal-persist-pubsub-listen-addr=0.0.0.0:6879
--persist-pubsub-url=http://persist-pubsub
--orchestrator-kubernetes-ephemeral-volume-class=standard
--secrets-controller=kubernetes
--timestamp-oracle-url=postgres://root@cockroach.default:26257?options=--search_path=tsoracle
State: Running
Started: Thu, 22 Aug 2024 22:10:23 +0000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 22 Aug 2024 22:09:24 +0000
Finished: Thu, 22 Aug 2024 22:10:23 +0000
Ready: True
Restart Count: 1
Environment:
MZ_SOFT_ASSERTIONS: 1
MZ_POD_NAME: environmentd-0 (v1:metadata.name)
AWS_REGION: minio
AWS_ACCESS_KEY_ID: minio
AWS_SECRET_ACCESS_KEY: minio123
MZ_ANNOUNCE_EGRESS_IP: 1.2.3.4,88.77.66.55
MZ_AWS_ACCOUNT_ID: 123456789000
MZ_AWS_EXTERNAL_ID_PREFIX: eb5cb59b-e2fe-41f3-87ca-d2176a495345
MZ_AWS_PRIVATELINK_AVAILABILITY_ZONES: use1-az1,use1-az2
MZ_AWS_CONNECTION_ROLE_ARN: arn:aws:iam::123456789000:role/MaterializeConnection
MZ_SYSTEM_PARAMETER_DEFAULT: enable_unsafe_functions=true;enable_dangerous_functions=true;persist_next_listen_batch_retryer_clamp=16s;persist_next_listen_batch_retryer_initial_backoff=100ms;persist_next_listen_batch_retryer_fixed_sleep=1200ms;persist_enable_arrow_lgalloc_noncc_sizes=true;persist_enable_s3_lgalloc_noncc_sizes=true;allow_real_time_recency=true;cluster_always_use_disk=true;compute_dataflow_max_inflight_bytes=134217728;compute_hydration_concurrency=2;disk_cluster_replicas_default=true;enable_0dt_deployment=false;enable_alter_swap=true;enable_assert_not_null=true;enable_columnation_lgalloc=true;enable_comment=true;enable_compute_chunked_stack=true;enable_connection_validation_syntax=true;enable_copy_to_expr=true;enable_disk_cluster_replicas=true;enable_eager_delta_joins=true;enable_envelope_debezium_in_subscribe=true;enable_expressions_in_limit_syntax=true;enable_introspection_subscribes=true;enable_logical_compaction_window=true;enable_multi_worker_storage_persist_sink=true;enable_mysql_source=true;enable_rbac_checks=true;enable_reduce_mfp_fusion=true;enable_refresh_every_mvs=true;enable_cluster_schedule_refresh=true;enable_sink_doc_on_option=true;enable_statement_lifecycle_logging=true;enable_table_keys=true;enable_variadic_left_join_lowering=true;enable_worker_core_affinity=true;persist_batch_columnar_format=both_v2;persist_batch_columnar_format_percent=10;persist_batch_delete_enabled=true;persist_batch_record_part_format=true;persist_fast_path_limit=1000;persist_inline_writes_single_max_bytes=4096;persist_inline_writes_total_max_bytes=1048576;persist_pubsub_client_enabled=true;persist_pubsub_push_diff_enabled=true;persist_record_compactions=true;persist_roundtrip_spine=true;persist_schema_register=true;persist_schema_require=true;persist_sink_minimum_batch_updates=128;persist_stats_audit_percent=100;persist_txn_tables=lazy;persist_use_critical_since_catalog=true;persist_use_critical_since_snapshot=true;persist_use_critical_since_source=true;persist_part_decode_format=row_with_validate;statement_logging_default_sample_rate=0.01;statement_logging_max_sample_rate=0.01;storage_persist_sink_minimum_batch_updates=100;storage_source_decode_fuel=100000;storage_use_reclock_v2=true;timestamp_oracle=postgres;wait_catalog_consolidation_on_startup=true;with_0dt_deployment_max_wait=100d
MZ_ADAPTER_STASH_URL: postgres://root@cockroach.default:26257?options=--search_path=adapter
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9b2z4 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-9b2z4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: environmentd=true:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 74s default-scheduler Successfully assigned default/environmentd-0 to mzcloud-worker5
Normal Pulled 14s (x2 over 73s) kubelet Container image "materialize/environmentd:mzbuild-RC3RXXTODEO37ECPIWGIZWTX3MSERBA6" already present on machine
Normal Created 14s (x2 over 73s) kubelet Created container environmentd
Normal Started 14s (x2 over 73s) kubelet Started container environmentd
It looks like the other thing that can cause a SIGKILL is Kubernetes attempting to terminate the pod, possibly due to a failed health check. And we do set the termination grace period to 0 (https://github.com/MaterializeInc/materialize/blob/24b180dccbcfb35978f59af5b24842a2eb4d6b19/misc/python/materialize/cloudtest/k8s/environmentd.py#L142), which I think means that Kubernetes will skip sending SIGTERM and jump right to SIGKILL.
That said, we don't have any probes configured for environmentd in cloudtest AFAICT, and there aren't any events about failed health checks or other reasons to kill the envd pod so ... that seems kinda unlikely.
What version of Materialize are you using?
6809a54012c4
What is the issue?
Seen in Cloudtest 7:
When checking the logs environmentd just crashes without a panic!:
Based on CI Failures this was already seen on June 28, so not a recent regression.
Failing test is:
test_missing_secret