MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

environmentd crashes when k8s secret is deleted #29072

Open def- opened 1 month ago

def- commented 1 month ago

What version of Materialize are you using?

6809a54012c4

What is the issue?

Seen in Cloudtest 7:

materialize.ui.CommandFailureCausedUIError: Running testdrive --materialize-url=postgres://materialize:materialize@environmentd:6875/materialize --materialize-internal-url=postgres://mz_system@environmentd:6877/materialize --kafka-addr=redpanda:9092 --schema-registry-url=http://redpanda:8081 --default-timeout=300s --log-filter=off --var=replicas=1 --var=single-replica-cluster=quickstart --var=default-storage-size=1 --var=default-replica-size=1 --aws-endpoint=http://minio-service.default:9000 --var=aws-endpoint=http://minio-service.default:9000 --aws-access-key-id=minio --var=aws-access-key-id=minio --aws-secret-access-key=minio123 --var=aws-secret-access-key=minio123 --no-reset in testdrive failed with:
Unable to use a TTY - input is not a terminal or the right kind of file
error: coordinator inconsistency: coordinator

Caused by:
    0: error sending request for url (http://environmentd:6878/api/coordinator/dump): error trying to connect: tcp connect error: Connection refused (os error 111)
    1: error trying to connect: tcp connect error: Connection refused (os error 111)
    2: tcp connect error: Connection refused (os error 111)
    3: Connection refused (os error 111)

When checking the logs environmentd just crashes without a panic!:

default/cluster-u2-replica-u2-gen-0-0[clusterd]: 2024-08-16T09:51:03.978080Z  INFO mz_storage::healthcheck: Health transition for source u5: Some(Running) -> Some(Stalled { error: "postgres: ApiError: secrets \"user-managed-u1\" not found: NotFound (ErrorResponse { status: \"Failure\", message: \"secrets \\\"user-managed-u1\\\" not found\", reason: \"NotFound\", code: 404 }): secrets \"user-managed-u1\" not found: NotFound", hints: {}, namespaced_errors: {Postgres: "ApiError: secrets \"user-managed-u1\" not found: NotFound (ErrorResponse { status: \"Failure\", message: \"secrets \\\"user-managed-u1\\\" not found\", reason: \"NotFound\", code: 404 }): secrets \"user-managed-u1\" not found: NotFound"} })
default/cluster-u2-replica-u2-gen-0-0[clusterd]: 2024-08-16T09:51:03.978152Z  INFO mz_storage::healthcheck: Broadcasting suspend-and-restart command because of Some((Postgres, Stalled { error: "ApiError: secrets \"user-managed-u1\" not found: NotFound (ErrorResponse { status: \"Failure\", message: \"secrets \\\"user-managed-u1\\\" not found\", reason: \"NotFound\", code: 404 }): secrets \"user-managed-u1\" not found: NotFound", hint: None, should_halt: true })) after 5s delay
default/redpanda-bf546879c-ksrwc[redpanda]: INFO  2024-08-16 09:51:05,297 [shard 0:main] cluster - controller_stm.cc:136 - created snapshot at offset 12 in 0 ms
default/redpanda-bf546879c-ksrwc[redpanda]: INFO  2024-08-16 09:51:05,309 [shard 0:main] storage - disk_log_impl.cc:2468 - Removing "/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:0, committed_offset:12, dirty_offset:12}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation=26, reader={/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.log, (3514 bytes)}, writer={no_of_chunks:64, closed:0, fallocation_offset:33554432, stable_offset:3514, flushed_offset:3514, committed_offset:3514, inflight:0, dispatched:0, merged:1, bytes_flush_pending:0}, cache={cache_size=13, dirty tracker: {min: -9223372036854775808, max: -9223372036854775808}}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/redpanda/controller/0_0/0-1-v1.base_index, offsets:0, index:{header_bitflags:0, base_offset:0, max_offset:12, base_timestamp:{timestamp: 1723801805168}, max_timestamp:{timestamp: 1723801825363}, batch_timestamps_are_monotonic:1, with_offset:false, non_data_timestamps:0, broker_timestamp:{{timestamp: 1723801825363}}, num_compactible_records_appended:{12}, index(1,1,1)}, step:32768, needs_persistence:1}})
default/environmentd-0[environmentd]: 2024-08-16T09:51:06.216453Z  WARN mz_server_core: error handling connection: error writing a body to connection: Broken pipe (os error 32) server="internal_http" conn_uuid=<unknown>
default/environmentd-0[environmentd]: /usr/local/bin/entrypoint.sh: line 19:    14 Killed                  environmentd --sql-listen-addr=0.0.0.0:6875 --http-listen-addr=0.0.0.0:6876 "$@"

Based on CI Failures this was already seen on June 28, so not a recent regression.

Failing test is: test_missing_secret

benesch commented 4 weeks ago
default/environmentd-0[environmentd]: /usr/local/bin/entrypoint.sh: line 19:    14 Killed                  environmentd --sql-listen-addr=0.0.0.0:6875 --http-listen-addr=0.0.0.0:6876 "$@"

"Killed" is indicative of a SIGKILL, which generally only occurs on OOM. Is it possible we just ran out of memory during this test?

def- commented 4 weeks ago

It doesn't seem like a test that would consume a lot of memory

benesch commented 4 weeks ago

The test itself wouldn't, but the infrastructure might—it's Kubernetes, Cockroach, MinIO, a bunch of controllers... that might add up.

def- commented 4 weeks ago

In that case we would also see this crash in other places so this doesn't fully explain why we see it only in test_missing_secret after deleting the k8s secret.

benesch commented 4 weeks ago

Yeah, I agree that's very suspicious. But I've also never seen Killed indicate anything besides an OOM kill (or someone explicitly sending a SIGKILL).

Maybe something about the test is causing a large chunk of memory to get allocated which then trips the OOM killer.

def- commented 3 weeks ago

I doubled the agent memory, and this happened again nevertheless: https://buildkite.com/materialize/test/builds/88993#01917c05-b42e-4800-b6ca-ef38f1f04136 So out of memory does not make sense to me. I will disable the test for now to unflake main.

benesch commented 3 weeks ago

Even though we don't see a Killed this time, we do see exit code 137, which means that someone sent a SIGKILL.

Name:         environmentd-0
Namespace:    default
Priority:     0
Node:         mzcloud-worker5/172.16.16.8
Start Time:   Thu, 22 Aug 2024 22:09:23 +0000
Labels:       app=environmentd
              apps.kubernetes.io/pod-index=0
              controller-revision-hash=environmentd-7c8976bf76
              statefulset.kubernetes.io/pod-name=environmentd-0
Annotations:  <none>
Status:       Running
IP:           10.244.3.4
IPs:
  IP:           10.244.3.4
Controlled By:  StatefulSet/environmentd
Containers:
  environmentd:
    Container ID:  containerd://8fbbcf033b6180e93a37c891de483846eb033349bf91a633856bdc89c7a8836a
    Image:         materialize/environmentd:mzbuild-RC3RXXTODEO37ECPIWGIZWTX3MSERBA6
    Image ID:      docker.io/library/import-2024-08-22@sha256:5bb4d346cda2e13e853186d6c0b0a1c1977f6bfcf4b55e7eb0e4bb867f36f0c4
    Port:          5432/TCP
    Host Port:     0/TCP
    Args:
      --availability-zone=1
      --availability-zone=2
      --availability-zone=3
      --aws-account-id=123456789000
      --aws-external-id-prefix=eb5cb59b-e2fe-41f3-87ca-d2176a495345
      --announce-egress-ip=1.2.3.4
      --announce-egress-ip=88.77.66.55
      --environment-id=cloudtest-test-00000000-0000-0000-0000-000000000000-0
      --persist-blob-url=s3://minio:minio123@persist/persist?endpoint=http%3A//minio-service.default%3A9000&region=minio
      --orchestrator=kubernetes
      --orchestrator-kubernetes-image-pull-policy=if-not-present
      --orchestrator-kubernetes-service-fs-group=999
      --persist-consensus-url=postgres://root@cockroach.default:26257?options=--search_path=consensus
      --internal-sql-listen-addr=0.0.0.0:6877
      --internal-http-listen-addr=0.0.0.0:6878
      --unsafe-mode
      --clusterd-image
      materialize/clusterd:mzbuild-P4H6HNNIR5ZMCJSNJCSLOONTFVMTGGXG
      --bootstrap-role
      materialize
      --internal-persist-pubsub-listen-addr=0.0.0.0:6879
      --persist-pubsub-url=http://persist-pubsub
      --orchestrator-kubernetes-ephemeral-volume-class=standard
      --secrets-controller=kubernetes
      --timestamp-oracle-url=postgres://root@cockroach.default:26257?options=--search_path=tsoracle
    State:          Running
      Started:      Thu, 22 Aug 2024 22:10:23 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 22 Aug 2024 22:09:24 +0000
      Finished:     Thu, 22 Aug 2024 22:10:23 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      MZ_SOFT_ASSERTIONS:                     1
      MZ_POD_NAME:                            environmentd-0 (v1:metadata.name)
      AWS_REGION:                             minio
      AWS_ACCESS_KEY_ID:                      minio
      AWS_SECRET_ACCESS_KEY:                  minio123
      MZ_ANNOUNCE_EGRESS_IP:                  1.2.3.4,88.77.66.55
      MZ_AWS_ACCOUNT_ID:                      123456789000
      MZ_AWS_EXTERNAL_ID_PREFIX:              eb5cb59b-e2fe-41f3-87ca-d2176a495345
      MZ_AWS_PRIVATELINK_AVAILABILITY_ZONES:  use1-az1,use1-az2
      MZ_AWS_CONNECTION_ROLE_ARN:             arn:aws:iam::123456789000:role/MaterializeConnection
      MZ_SYSTEM_PARAMETER_DEFAULT:            enable_unsafe_functions=true;enable_dangerous_functions=true;persist_next_listen_batch_retryer_clamp=16s;persist_next_listen_batch_retryer_initial_backoff=100ms;persist_next_listen_batch_retryer_fixed_sleep=1200ms;persist_enable_arrow_lgalloc_noncc_sizes=true;persist_enable_s3_lgalloc_noncc_sizes=true;allow_real_time_recency=true;cluster_always_use_disk=true;compute_dataflow_max_inflight_bytes=134217728;compute_hydration_concurrency=2;disk_cluster_replicas_default=true;enable_0dt_deployment=false;enable_alter_swap=true;enable_assert_not_null=true;enable_columnation_lgalloc=true;enable_comment=true;enable_compute_chunked_stack=true;enable_connection_validation_syntax=true;enable_copy_to_expr=true;enable_disk_cluster_replicas=true;enable_eager_delta_joins=true;enable_envelope_debezium_in_subscribe=true;enable_expressions_in_limit_syntax=true;enable_introspection_subscribes=true;enable_logical_compaction_window=true;enable_multi_worker_storage_persist_sink=true;enable_mysql_source=true;enable_rbac_checks=true;enable_reduce_mfp_fusion=true;enable_refresh_every_mvs=true;enable_cluster_schedule_refresh=true;enable_sink_doc_on_option=true;enable_statement_lifecycle_logging=true;enable_table_keys=true;enable_variadic_left_join_lowering=true;enable_worker_core_affinity=true;persist_batch_columnar_format=both_v2;persist_batch_columnar_format_percent=10;persist_batch_delete_enabled=true;persist_batch_record_part_format=true;persist_fast_path_limit=1000;persist_inline_writes_single_max_bytes=4096;persist_inline_writes_total_max_bytes=1048576;persist_pubsub_client_enabled=true;persist_pubsub_push_diff_enabled=true;persist_record_compactions=true;persist_roundtrip_spine=true;persist_schema_register=true;persist_schema_require=true;persist_sink_minimum_batch_updates=128;persist_stats_audit_percent=100;persist_txn_tables=lazy;persist_use_critical_since_catalog=true;persist_use_critical_since_snapshot=true;persist_use_critical_since_source=true;persist_part_decode_format=row_with_validate;statement_logging_default_sample_rate=0.01;statement_logging_max_sample_rate=0.01;storage_persist_sink_minimum_batch_updates=100;storage_source_decode_fuel=100000;storage_use_reclock_v2=true;timestamp_oracle=postgres;wait_catalog_consolidation_on_startup=true;with_0dt_deployment_max_wait=100d
      MZ_ADAPTER_STASH_URL:                   postgres://root@cockroach.default:26257?options=--search_path=adapter
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9b2z4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-9b2z4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 environmentd=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                From               Message
  ----    ------     ----               ----               -------
  Normal  Scheduled  74s                default-scheduler  Successfully assigned default/environmentd-0 to mzcloud-worker5
  Normal  Pulled     14s (x2 over 73s)  kubelet            Container image "materialize/environmentd:mzbuild-RC3RXXTODEO37ECPIWGIZWTX3MSERBA6" already present on machine
  Normal  Created    14s (x2 over 73s)  kubelet            Created container environmentd
  Normal  Started    14s (x2 over 73s)  kubelet            Started container environmentd

It looks like the other thing that can cause a SIGKILL is Kubernetes attempting to terminate the pod, possibly due to a failed health check. And we do set the termination grace period to 0 (https://github.com/MaterializeInc/materialize/blob/24b180dccbcfb35978f59af5b24842a2eb4d6b19/misc/python/materialize/cloudtest/k8s/environmentd.py#L142), which I think means that Kubernetes will skip sending SIGTERM and jump right to SIGKILL.

benesch commented 3 weeks ago

That said, we don't have any probes configured for environmentd in cloudtest AFAICT, and there aren't any events about failed health checks or other reasons to kill the envd pod so ... that seems kinda unlikely.