v1.4.0 crashes after helm install into kubernetes cluster

roehnan commented 1 year ago

Describe the bug Dragonfly DB v1.4.0 was installed in a test kubernetes cluster via helm. The created pod crashes with a Check failed: state->channel_sock == socket_fd error in dns_resolve.cc:88.

v1.3.0 did not exhibit this behavior when installed in the same cluster.

To Reproduce Install v1.4.0 via helm

Expected behavior Dragonfly DB pod starts successfully.

Environment (please complete the following information):

OS: ubuntu 20.04
Kernel: 5.15.0-58-generic #64~20.04.1-Ubuntu SMP Fri Jan 6 16:42:31 UTC 2023 x86_64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version: 1.4.0

Reproducible Code Snippet Dragonfly DB installed via helm:

helm upgrade -f myvals.yaml -n jbiagio --install dragonfly-messagechannels oci://ghcr.io/dragonflydb/dragonfly/helm/dragonfly --version v1.4.0

Contents of myvals.yaml:

extraArgs:
  - --cluster_mode=emulated
  - --maxmemory=16gb
  - --dbnum=1
  - --logtostderr
  - --cache_mode=true

podSecurityContext:
  fsGroup: 1001

securityContext:
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1001

podAnnotations:
  prometheus.io/port: "6379"
  prometheus.io/scrape: "true"

Additional context Here are the logs from the crashing pod:

I20230627 14:56:31.320181     1 init.cc:69] dragonfly running in opt mode.
I20230627 14:56:31.320291     1 dfly_main.cc:618] Starting dragonfly df-v1.4.0-6d4d740d6e2a060cbbbecd987ee438cc6e60de79
I20230627 14:56:31.320462     1 dfly_main.cc:671] Max memory limit is: 16.00GiB
I20230627 14:56:31.353132    24 uring_proactor.cc:158] IORing with 1024 entries, allocated 102720 bytes, cq_entries is 2048
I20230627 14:56:31.425410     1 proactor_pool.cc:147] Running 56 io threads
I20230627 14:56:32.763758     1 server_family.cc:161] Data directory is "/data"
I20230627 14:56:32.774256    11 listener_interface.cc:96] sock[115] AcceptServer - listening on port 6379
F20230627 14:56:32.823841    12 dns_resolve.cc:88] Check failed: state->channel_sock == socket_fd (116 vs. 117) 
*** Check failure stack trace: ***
    @     0x555bf0f49ce3  google::LogMessage::SendToLog()
    @     0x555bf0f424a7  google::LogMessage::Flush()
    @     0x555bf0f43e2f  google::LogMessageFatal::~LogMessageFatal()
    @     0x555bf0a9769f  _ZN4util3fb212_GLOBAL__N_121UpdateSocketsCallbackEPviii.lto_priv.0
    @     0x555bf0d6d84c  ares__send_query
    @     0x555bf0d6e22c  process_answer.part.0
    @     0x555bf0d6e932  read_udp_packets
    @     0x555bf0d6ea62  processfds
    @     0x555bf0a3fd9f  util::fb2::DnsResolve()
    @     0x555bf0aafe55  util::http::Client::Reconnect()
    @     0x555bf0ab0111  util::http::Client::Connect()
    @     0x555bf0a40074  util::http::TlsClient::Connect()
    @     0x555bf0d06d6a  _ZN4dfly12_GLOBAL__N_114VersionMonitor7RunTaskEP10ssl_ctx_st.lto_priv.0
    @     0x555bf0ce3f47  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberENS0_21basic_fixedsize_stackINS0_12stack_traitsEEEZN4util3fb26detail15WorkerFiberImplIZN4dfly12_GLOBAL__N_114VersionMonitor3RunEPNS8_12ProactorPoolEEUlvE_JEEC4IS7_EESt17basic_string_viewIcSt11char_traitsIcEERKNS0_12preallocatedEOT_OSH_EUlOS4_E_EEEEvNS1_10transfer_tE.lto_priv.0
    @     0x555bf0d686df  make_fcontext
*** SIGABRT received at time=1687877792 on cpu 3 ***
PC: @     0x7fda9916400b  (unknown)  raise
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7fda99143941 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1687877792 on cpu 3 ***
PC: @     0x7fda99143941  (unknown)  abort

chakaz commented 1 year ago

Hi @roehnan, thanks for reporting this! I think I know where the problem is. While I work on a fix, may I ask that you try running Dragonfly with an added flag: --version_check=false and see if this is a sufficient workaround until a fix is in place?

roehnan commented 1 year ago

Hello @chakaz. I've added the --version_check=false flag and the pod now starts correctly. It looks like using this flag as a workaround will be good for now.

chakaz commented 1 year ago

Fantastic, I'm glad we found a workaround. As I mentioned, I would like to fix it so that a workaround is not needed. May I kindly ask that you run this command inside the container and paste the output here?

grep nameserver /etc/resolv.conf

The reason I ask this is because the crash stack shows that something unusual happens while resolving DNS, like that it uses multiple sockets in parallel to do the lookup. I wonder what causes this. Thanks!

roehnan commented 1 year ago

grep nameserver /etc/resolv.conf 
nameserver 172.29.32.2

dragonflydb / dragonfly

v1.4.0 crashes after helm install into kubernetes cluster #1486