NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.49k stars 12.99k forks source link

foundationdb coredumps on 24.05 calling into `cxx1112regex_traits` #319537

Open siriobalmelli opened 3 weeks ago

siriobalmelli commented 3 weeks ago

Describe the bug

foundationdb dumps core in the following system configurations:

foundationdb architecture kernel nixpkgs status
7.1.32 aarch64 6.1.92 release-24.05 coredump
7.1.32 aarch64 6.1.92 release-23.11 ok
7.1.32 aarch64 6.6.32 release-24.05 coredump
7.1.32 aarch64 6.6.32 release-23.11 ok
7.1.30 aarch64 - - build fail
7.1.32 x86_64 6.1.92 release-24.05 coredump
7.1.32 x86_64 6.1.92 release-23.11 ok
7.1.32 x86_64 6.6.32 release-24.05 coredump
7.1.32 x86_64 6.6.32 release-23.11 ok
7.1.30 x86_64 6.1.92 release-24.05 coredump
7.1.30 x86_64 6.1.92 release-23.11 ok
7.1.30 x86_64 6.6.32 release-24.05 coredump
7.1.30 x86_64 6.6.32 release-23.11 ok

Steps To Reproduce

Set up a single machine test cluster using a minimal flake:

{
  description = "foundationdb crash reproduction";

  inputs = {
    nixpkgs-24_05.url = "github:nixos/nixpkgs/release-24.05";
    nixpkgs-23_11.url = "github:nixos/nixpkgs/release-23.11";
  };

  outputs = {self, ...} @ inputs: let
    inherit (inputs.nixpkgs-24_05.lib) nixosSystem; # toggle nixpkgs here
  in {
    nixosConfigurations.test-system = nixosSystem {
      system = "x86_64-linux"; # toggle architecture here
      modules = [
        ({
          modulesPath,
          pkgs,
          ...
        }: {
          imports = [
            "${modulesPath}/virtualisation/amazon-image.nix"
          ];

          # boot.kernelPackages = pkgs.linuxPackages_6_1;
          boot.kernelPackages = pkgs.linuxPackages_6_6; # toggle kernel here

          ec2.hvm = true;

          networking.useDHCP = true;

          services.foundationdb = {
            enable = true;

            extraReadWritePaths = ["/run/foundationdb"];
            listenAddress = "127.0.0.1:4500";
            listenPortStart = 4500;
            openFirewall = true;
            package = pkgs.foundationdb71;
            pidfile = "/run/foundationdb/fdb.pid";
            publicAddress = "127.0.0.1";
            restartDelay = 120;
            serverProcesses = 1;
            traceFormat = "json";
          };

          system.stateVersion = "24.05";
        })
      ];
    };
  };
}

See comments above for where to toggle nixpkgs, architecture, kernel; changing foundationdb version is outside the scope of this simple reproduction but suffice it to say I've tested that also.

Resulting coredump can be seen with:

coredumpctl list | grep fdbserver | tail -n 1 | awk '{ print $5 }' | xargs coredumpctl info

Example:

           PID: 1320 (fdbserver)
           UID: 118 (foundationdb)
           GID: 118 (foundationdb)
        Signal: 11 (SEGV)
     Timestamp: Thu 2024-06-13 09:22:18 UTC (17min ago)
  Command Line: /nix/store/cz1i01ckbvrxn1gli0bbrim16dvznqv7-foundationdb-7.1.32/bin/fdbserver --cluster_file /etc/foundationdb/fdb.cluster --datadir /var/lib/foundationdb/4500 --listen_address 127.0.0.1:4500 --logdir /var/log/foundationdb --logsize 10MiB --maxlogssize 100MiB --memory 8GiB --public_address 127.0.0.1:4500 --storage_memory 1GiB --trace_format json
    Executable: /nix/store/cz1i01ckbvrxn1gli0bbrim16dvznqv7-foundationdb-7.1.32/bin/fdbserver
 Control Group: /system.slice/foundationdb.service
          Unit: foundationdb.service
         Slice: system.slice
       Boot ID: 4b0c405bd88a4031b58c8dceb9be882e
    Machine ID: ec26ef85d6581da22538098e8836259e
      Hostname: ip-172-29-141-193.eu-west-1.compute.internal
       Storage: /var/lib/systemd/coredump/core.fdbserver.118.4b0c405bd88a4031b58c8dceb9be882e.1320.1718270538000000.zst (present)
  Size on Disk: 558.0K
       Message: Process 1320 (fdbserver) of user 118 dumped core.

                Module libgcc_s.so.1 without build-id.
                Module libstdc++.so.6 without build-id.
                Module libboost_context.so.1.78.0 without build-id.
                Stack trace of thread 1320:
                #0  0x0000000002a25854 _ZNKSt7codecvtIDic11__mbstate_tE10do_unshiftERS0_PcS3_RS3_ (fdbserver + 0x2625854)
                #1  0x0000000001d88710 _ZNSt8__detail15_BracketMatcherINSt7__cxx1112regex_traitsIcEELb0ELb0EE8_M_readyEv (fdbserver + 0x1988710)
                #2  0x0000000001d88aac _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE25_M_insert_bracket_matcherILb0ELb0EEEvb (fdbserver + 0x1988aac)
                #3  0x0000000001d9a60d _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE7_M_atomEv (fdbserver + 0x199a60d)
                #4  0x0000000001d99083 _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE14_M_alternativeEv (fdbserver + 0x1999083)
                #5  0x0000000001d9965b _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE14_M_disjunctionEv (fdbserver + 0x199965b)
                #6  0x0000000001d9a443 _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE7_M_atomEv (fdbserver + 0x199a443)
                #7  0x0000000001d99083 _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE14_M_alternativeEv (fdbserver + 0x1999083)
                #8  0x0000000001d99161 _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE14_M_alternativeEv (fdbserver + 0x1999161)
                #9  0x0000000001d9965b _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEE14_M_disjunctionEv (fdbserver + 0x199965b)
                #10 0x00000000023dc723 _ZNSt8__detail9_CompilerINSt7__cxx1112regex_traitsIcEEEC2EPKcS6_RKSt6localeNSt15regex_constants18syntax_option_typeE.constprop.0 (fdbserver + 0x1fdc723)
                #11 0x0000000001d8dbad _ZN8Hostname10isHostnameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (fdbserver + 0x198dbad)
                #12 0x0000000001da2334 _ZN23ClusterConnectionStringC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (fdbserver + 0x19a2334)
                #13 0x0000000001ca267c _ZN21ClusterConnectionFileC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (fdbserver + 0x18a267c)
                #14 0x000000000139faaa _ZN12_GLOBAL__N_110CLIOptions17parseArgsInternalEiPPc (fdbserver + 0xf9faaa)
                #15 0x0000000000e001ca main (fdbserver + 0xa001ca)
                #16 0x00007fbf4e75a10e __libc_start_call_main (libc.so.6 + 0x2a10e)
                #17 0x00007fbf4e75a1c9 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a1c9)
                #18 0x0000000000e520d5 _start (fdbserver + 0xa520d5)
                ELF object binary architecture: AMD x86-64

Expected behavior

A running single-node foundationdb cluster, check with sudo fdbcli --exec status:

Broken System

SIGNAL: Segmentation fault (11)
Trace: addr2line -e fdbcli.debug -p -C -f -i 0x7335ac 0x728a3d 0x72aa33 0x72ab11 0x72b00b 0xc02473 0x84810b 0x6214ea 0x7ff65d43d10e
Segmentation fault

Working System

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1
  Zones                  - 1
  Machines               - 1
  Memory availability    - 7.5 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 06/13/24 10:14:01

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 210 MB

Operating space:
  Storage server         - 3.1 GB free on most full server
  Log server             - 3.1 GB free on most full server

Workload:
  Read rate              - 16 Hz
  Write rate             - 0 Hz
  Transactions started   - 4 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 06/13/24 10:14:01

Additional context

Looking at the dependency tree with:

nix-tree .#nixosConfigurations.test-system.config.services.foundationdb.package

The issue appears to be the glibc version change 2.38-77 -> glibc-2.39-52, which is both a direct dependency of foundationdb and an indirect dependency via boost-1.78.0, it was not obvious how to test this further.

I am happy to collect additional data as needed.

Notify maintainers

  1. foundationdb maintainers:

    @thoughtpolice @lostnet

  2. glibc maintainers:

    @eelco @ma27 @connorbaker

Metadata

Broken System

Working System

siriobalmelli commented 2 weeks ago

Bump.

If there's anything else I can do to better debug please let me know.

lostnet commented 2 weeks ago

It looks to me like the implementation of ClusterConnectionString was replaced in newer versions so not encountering this would probably be a benefit of updating the version, so that may be an option. (But I am not able to participate in that process.)

siriobalmelli commented 1 week ago

It looks to me like the implementation of ClusterConnectionString was replaced in newer versions so not encountering this would probably be a benefit of updating the version, so that may be an option. (But I am not able to participate in that process.)

@lostnet thank you for your input. Could you tag who you think might be the right person for this? 🙏