Open spacekitteh opened 1 year ago
What's their error msg?
Ok, it seems that there were two issues for ceph-mds
:
/etc/ceph/ceph.mon.keyring
and /etc/ceph/ceph.client.admin.keyring
needed to have proper permissions set (0444 worked, but it could probably be done better by setting the appropriate users on the services instead)kmod
needed to be in the path for the services which mount the cephfs filesystems.However, now that the ceph cluster is up and running, there's still an issue: Nomad isn't working.
The journal is just this, over and over and over:
ug 04 23:53:12 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:12 example1 nomad[2905]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:12 example1 nomad[2905]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:12 example1 nomad[2905]: ==> Starting Nomad agent...
Aug 04 23:53:12 example1 skyflake-install-cache-gc-start[2911]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:12 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:16 example1 nomad[2905]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:12.718Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:16 example1 nomad[2905]: 2023-08-04T23:53:16.185Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:16 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 62.
Aug 04 23:53:18 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:18 example1 systemd[1]: nomad.service: Consumed 25ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started Nomad.
Aug 04 23:53:18 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 23ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:18 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:18 example1 nomad[2920]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:18 example1 nomad[2920]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:18 example1 nomad[2920]: ==> Starting Nomad agent...
Aug 04 23:53:18 example1 skyflake-install-cache-gc-start[2927]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:18 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:21 example1 nomad[2920]: ==> Error starting agent: server config setup failed: Failed to resolve Serf advertise address ":4648": lookup <nil>: no such host
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.468Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.469Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:18.470Z [ERROR] agent.plugin_loader.docker: failed to list pause containers: plugin_dir=/nix/store/c6z7yz60xqiiw909jrld02lc44ysl7z2-nomad-plugins/bin error=<nil>
Aug 04 23:53:21 example1 nomad[2920]: 2023-08-04T23:53:21.935Z [ERROR] agent: error starting agent: error="server config setup failed: Failed to resolve Serf advertise address \":4648\": lookup <nil>: no such host"
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Aug 04 23:53:21 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 63.
Aug 04 23:53:24 example1 systemd[1]: Stopped Nomad.
Aug 04 23:53:24 example1 systemd[1]: nomad.service: Consumed 29ms CPU time, received 80B IP traffic, sent 120B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started Nomad.
Aug 04 23:53:24 example1 systemd[1]: Stopped skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 22ms CPU time, received 40B IP traffic, sent 60B IP traffic.
Aug 04 23:53:24 example1 systemd[1]: Started skyflake-install-cache-gc.service.
Aug 04 23:53:24 example1 nomad[2935]: ==> WARNING: Number of bootstrap servers should ideally be set to an odd number.
Aug 04 23:53:24 example1 nomad[2935]: ==> Loaded configuration from /etc/nomad.json
Aug 04 23:53:24 example1 nomad[2935]: ==> Starting Nomad agent...
Aug 04 23:53:24 example1 skyflake-install-cache-gc-start[2941]: Error submitting job: Put "http://127.0.0.1:4646/v1/jobs": dial tcp 127.0.0.1:4646: connect: connection refused
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Failed with result 'exit-code'.
Aug 04 23:53:24 example1 systemd[1]: skyflake-install-cache-gc.service: Consumed 17ms CPU time, received 40B IP traffic, sent 60B IP traffic.
I suspect it may be due to having virbr0
set up by libvirtd
or podman
or something? I'm still trying to figure it out.
Could you please share your skyflake.nomad
config?
It's just the one in example-server.nix
:
nomad = {
servers = [ "example1" "example2" "example3" ];
client.meta = {
example-deployment = "yes";
};
};
The example vms use the fec0::/64
range which is dropped by NixOS 23.05's neat nixos-fw-rpfilter feature.
Quick and dirty workaround: ip6tables -t mangle -F PREROUTING
on the host
Also, /dev/vdc has become /dev/vdb
None of the
ceph-mds
daemons manage to successfully start, and so no ceph filesystems can be mounted.