NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.48k stars 12.97k forks source link

apache-kafka connected to local zookeeper often fails to start up #127226

Open antifuchs opened 3 years ago

antifuchs commented 3 years ago

Describe the bug

I'm running apache-kafka connected to a local zookeeper (pretty much the default configuration if you enable both services), and on boot, the apache-kafka service often fails to start, with an error like:

Jun 17 09:37:48 ferdl java[3870]: [2021-06-17 09:37:48,428] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
Jun 17 09:37:48 ferdl java[3870]: [2021-06-17 09:37:48,442] ERROR Error while creating ephemeral at /brokers/ids/1001, node already exists and owner '72063222816702465' does not match current session '72057>
Jun 17 09:37:48 ferdl java[3870]: [2021-06-17 09:37:48,445] ERROR [KafkaServer id=1001] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
Jun 17 09:37:48 ferdl java[3870]: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
Jun 17 09:37:48 ferdl java[3870]:         at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1821)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1759)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1726)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:95)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.server.KafkaServer.startup(KafkaServer.scala:293)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.Kafka$.main(Kafka.scala:82)
Jun 17 09:37:48 ferdl java[3870]:         at kafka.Kafka.main(Kafka.scala)
Jun 17 09:37:48 ferdl java[3870]: [2021-06-17 09:37:48,447] INFO [KafkaServer id=1001] shutting down (kafka.server.KafkaServer)

This seems to indicate that there are broker registrations in the zookeeper directory that weren't properly cleared, as can happen when kafka doesn't shut down cleanly.

To Reproduce Steps to reproduce the behavior:

  1. Setup kafka and zookeeper:
    services.zookeeper.enable = true;
    services.apache-kafka = {
    enable = true;
    jre = pkgs.jre8_headless;
    extraProperties = ''
      offsets.topic.replication.factor=1
      listeners=PLAINTEXT://localhost:9092
    '';
    };
  2. build the system and ensure they are running
  3. reboot the machine
  4. look at systemctl list-units --failed

About 75% of times I reboot the machine, apache-kafka is listed in the failed units, with the log indicating that it found an old broker ID.

Expected behavior

apache-kafka should start up porperly every time on boot.

Additional context

I'm pretty sure this is rooted in a missing dependency between the kafka service and the zookeeper service; as the nixos config already has a setting for kafka's zookeeper servers, we could put a requires & after clause in kafka's unit if it should talk to localhost & zookeeper is also enabled.

Notify maintainers @ragnard @srhb

Metadata

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
- services.apache-kafka
antifuchs commented 3 years ago

To make this more clear, I don't think this is an issue with startup of the two units as much as shutdown: If both units get stopped during a reboot, my suspicion is that zookeeper stops before kafka can deregister itself, leading to old registrations sitting around, which then prevent a successful startup until zookeeper has had a chance to clean out old sessions (10min into bootup or whatever the session timeout is).

roberth commented 2 years ago

Fwiw, kafka is moving away from zookeeper, maintaining its own metadata using its own means instead.

srhb commented 1 year ago

Re. KRaft, I have put up a PR for a -- hopefully -- unopinionated way of achieving this. I don't think I like making assumptions on where the controller lives whether or not it's Zookeeper or KRaft -- colocation is incidental from the POV of NixOS, I think. Not that it's a huge deal to place a systemd ordering, but it's also not a huge deal on the user end.

Regardless, PTAL at #203987 and the PR #224611 -- maybe I am moving in the wrong direction by making the module less smart, but I personally think it makes us (as in nixpkgs/NixOS) more flexible.