astro / skyflake

NixOS Hyperconverged Infrastructure on Nomad/NixOS
https://astro.github.io/skyflake/
MIT License
192 stars 6 forks source link

Running the example doesn't actually work. #4

Open spacekitteh opened 1 year ago

spacekitteh commented 1 year ago

None of the ceph-mds daemons manage to successfully start, and so no ceph filesystems can be mounted.

astro commented 1 year ago

What's their error msg?

spacekitteh commented 1 year ago

Ok, it seems that there were two issues for ceph-mds:

  1. The /etc/ceph/ceph.mon.keyring and /etc/ceph/ceph.client.admin.keyring needed to have proper permissions set (0444 worked, but it could probably be done better by setting the appropriate users on the services instead)
  2. kmod needed to be in the path for the services which mount the cephfs filesystems.

However, now that the ceph cluster is up and running, there's still an issue: Nomad isn't working.

The journal is just this, over and over and over:

ug 04 23:53:12 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:12 example1 nomad[2905]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:12 example1 nomad[2905]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:12 example1 nomad[2905]: ==> Starting Nomad agent...
Aug 04 23:53:12 example1 skyflake-install-cache-gc-start[2911]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:16 example1 nomad[2905]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:12.718Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]:     2023-08-04T23:53:16.185Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 62.
Aug 04 23:53:18 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started Nomad.
Aug 04 23:53:18 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 nomad[2920]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:18 example1 nomad[2920]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:18 example1 nomad[2920]: ==> Starting Nomad agent...
Aug 04 23:53:18 example1 skyflake-install-cache-gc-start[2927]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:21 example1 nomad[2920]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.468Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.469Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]:     2023-08-04T23:53:21.935Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 63.
Aug 04 23:53:24 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started Nomad.
Aug 04 23:53:24 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 nomad[2935]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:24 example1 nomad[2935]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:24 example1 nomad[2935]: ==> Starting Nomad agent...
Aug 04 23:53:24 example1 skyflake-install-cache-gc-start[2941]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 17ms CPU time, received 40B IP traffic, sent 60B IP traffic.

I suspect it may be due to having virbr0 set up by libvirtd or podman or something? I'm still trying to figure it out.

astro commented 1 year ago

Could you please share your skyflake.nomad config?

spacekitteh commented 1 year ago

It's just the one in example-server.nix:

    nomad = {
      servers = [ "example1" "example2" "example3" ];
      client.meta = {
        example-deployment = "yes";
      };
    };
astro commented 1 year ago

The example vms use the fec0::/64 range which is dropped by NixOS 23.05's neat nixos-fw-rpfilter feature.

Quick and dirty workaround: ip6tables -t mangle -F PREROUTING on the host

Also, /dev/vdc has become /dev/vdb