SynologyOpenSource / synology-csi

Apache License 2.0
522 stars 114 forks source link

Compatibility with Nomad #14

Closed johnnyplaydrums closed 2 years ago

johnnyplaydrums commented 2 years ago

Hello! I was wondering if synology-csi works with Nomad? At first glance it would appear there is only support for Kubernetes, but I just wanted to double check. Thank you

ressu commented 2 years ago

considering that Nomad supports standard CSI interfaces and claims that the Kubernetes CSI plugins work out of the box, I don't see any reason why it wouldn't work. The bigger question is how to configure the plugin to work as intended.

johnnyplaydrums commented 2 years ago

Hi @ressu, thank you for the response. I agree that the varies documentation claims that synology-csi should work with Nomad, but I'm not sure if that is true or not. Have you heard of anyone successfully using synology-csi with Nomad? I'd love to learn from their experience. When I tried to deploy this synology-csi container into Nomad, I got the error:

[FATAL] [driver/grpc.go:91] Failed to listen: listen unix //var/lib/kubelet/plugins/csi.san.synology.com/csi.sock: bind: no such file or directory

As you can see that's pointing to a kubernetes-specific endpoint /var/lib/kubelet/. It seems that this plugin has a lot of hard-coded kubernets configuration that doesn't look like it can be configured otherwise, for example the csiEndpoint mentioned above has kubelet hard-coded into it: https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/main.go#L23

I am new to csi plugins so maybe I just need to learn more about them in order to properly configure synology-csi to work with Nomad. But from what I can tell it doesn't seem like it will work, what do you think?

ressu commented 2 years ago

unfortunately I don't know how Nomad invokes the CSI daemons, which would give me a better idea on how to solve this. The path for the csi socket can be overridden with a flag -e or --endpoint as seen here https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/main.go#L98

many of the default features are mainly handled by the generic csi containers, so that would change the situation a bit too.

That being said, if you can adjust the startup of the csi plugin and add --endpoint=/csi/csi.sock to the startup, you might be able to get something going

johnnyplaydrums commented 2 years ago

Awesome, that's super helpful @ressu! I'll close this for now. If anyone else has advice on properly configuring this plugin to work with Nomad, please reach out :)

johnnyplaydrums commented 2 years ago

Hey @ressu, another question for you. I was able to make some progress and get the synology-csi plugin running in Nomad by setting --endpoint=unix:///csi/csi.sock. I've registered the Synology volume and was trying to deploy a job using that volume when I got the follow error:

2022-01-05T21:14:44Z [INFO] [driver/utils.go:104] GRPC call: /csi.v1.Node/NodeStageVolume
2022-01-05T21:14:44Z [INFO] [driver/utils.go:105] GRPC request: {"staging_target_path":"/csi/staging/scada-test/ro-file-system-single-node-reader-only","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["noatime"]}},"access_mode":{"mode":2}},"volume_id":"1"}
2022-01-05T21:14:44Z [ERROR] [driver/utils.go:108] GRPC error: rpc error: code = Internal desc = rpc error: code = NotFound desc = Volume[1] is not found

It's unable to find the volume on our Synology DSM: Volume[1] is not found. I also tried registering the volume as "Volume 1", "/volume1", and combinations like that, but no luck. Our Synology device just has 1 volume called Volume 1 in the DSM dashboard. I'm not sure what the volume_id is supposed to be. Do you know how I can find what the volume_id is for our Synology volume?

ressu commented 2 years ago

I checked my logs and it seems that the volume_id is the UUID of the LUN:

2021-12-23T23:18:29Z [INFO] [driver/utils.go:104] GRPC call: /csi.v1.Node/NodeStageVolume
2021-12-23T23:18:29Z [INFO] [driver/utils.go:105] GRPC request: {"staging_target_path":"/var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pod-config/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"dsm":"10.5.1.2"},"volume_id":"c189b0b4-5bbd-40d6-b1b8-bb8645218402"}

I think the DSM in context is also required so that the CSI knows which DSM to contact, but I'm not certain.

Also, make sure that your volumes have an appropriate prefix as defined in https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/pkg/models/dsm.go#L19-L21

johnnyplaydrums commented 2 years ago

Thanks @ressu. Apologies for my ignorance but how would I find the UUID of the LUN?

ressu commented 2 years ago

Oh, right! It's not immediately visible. I think I found the UUID by inspecting the source code in the admin UI. I had to dig around there when I migrated my old volumes. The normal volume creation automation in Kubernetes will figure it automatically for the volumes created by the CSI. But for externally created volumes the pain of searching the HTML in admin console was real :(

johnnyplaydrums commented 2 years ago

Oh boy that is a bit hacky. Ok I'll look around. Do you remember what part of the UI you were able to find it in?

ressu commented 2 years ago

Yeah, It's very hacky. I preferred the third party CSI mechanism over this one, but you use what you can :smile:

The way you can find the UUID of the volume is in SAN Manager. If you list the LUNs there and look at the source code, the itemid attribute will list the UUID of each volume. It's a bit of a pain to list the volumes though since the UI tries to refresh the HTML all the time. So depending on your browser you might need to try to introduce a breakpoint somewhere to see the actual contents.

The way I got it is in chrome using the inspector and I just created a breakpoint to the HTML element subtree modifications, which froze things for long enough to properly copy the UUID from the list.

johnnyplaydrums commented 2 years ago

Awesome! I was able to find the UUID via the itemid in the source code on the SAN manager page. I hadn't yet created a LUN, so I first did that, and then grabbed the UUID. I am still getting the same error Volume[763336ca-0f20-4fcf-8e8d-3406168c60fc] is not found but I'm wondering if it's related to your other comment above:

Also, make sure that your volumes have an appropriate prefix as defined in

https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/pkg/models/dsm.go#L19-L21

The volume prefix - is that something I configure on the DSM side or the Nomad side? I don't see a prefix option in the LUN configuration:

Screen Shot 2022-01-05 at 3 06 06 PM

fwiw the IqnPrefix does appear to be correct:

Screen Shot 2022-01-05 at 3 05 15 PM
ressu commented 2 years ago

The prefix goes into the LUN name. I think you also need to create an iSCSI host with the same prefix. Mine are in the form of k8s-csi-<kubernetes volume name>. So the suffix in the name doesn't matter as long as it starts with k8s-csi.

chihyuwu commented 2 years ago

The synology-csi only looks for LUNs with the prefix "k8s-csi". Try to change the LUN name from "LUN-1" to "k8s-csi-LUN-1".

johnnyplaydrums commented 2 years ago

Sweet, thanks to both of you! synology-csi was able to successfully find the volume after using the UUID and creating a LUN and iSCSI host with the name k8s-csi-LUN-1. Now on to the next error which is:

2022-01-06T18:59:06Z [ERROR] [driver/initiator.go:37] Failed to run iscsiadm session: exit status 1
2022-01-06T18:59:06Z [ERROR] [driver/initiator.go:114] Failed in discovery of the target: Couldn't find hostPath: /host in the CSI container (exit status 1)
2022-01-06T18:59:06Z [ERROR] [driver/utils.go:108] GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = Failed to login with target iqn [iqn.2000-01.com.synology:RackStationNYHQ.Target-1.72e4481bb23], err: Couldn't find hostPath: /host in the CSI container (exit status 1)

The chroot.sh script is indicating that /host directory needs to be available into the container, is that right? Can you elaborate on what's needed here? https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/chroot/chroot.sh#L4

johnnyplaydrums commented 2 years ago

fwiw I tried exec-ing into the running synology-csi containers and mkdir /host just to see if making that directory available helped, but synology-csi now says

err: chroot: can't execute '/usr/bin/env': No such file or directory
ressu commented 2 years ago

The /host directory is a bind mount of the filesystem from the node (machine which is doing the mounting)

Relevant Kubernetes configurations are https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/deploy/kubernetes/v1.19/node.yml#L112-L113 and https://github.com/SynologyOpenSource/synology-csi/blob/dc05a795b79b911ec5882c3c837a7779cf3576a8/deploy/kubernetes/v1.19/node.yml#L132-L135

I don't know how the containers are configured for Monad, but effectively you need to mount / into the container as the directory /host.

johnnyplaydrums commented 2 years ago

Ok makes sense, I need to make a Nomad client config change to make that root filesystem from the host available, so I will work on that and then try deploying the job again. My instinct makes me nervous to mount the entire host filesystem inside the container. Would you mind describing why is this needed?

Thanks again @ressu for helping here! I'm hoping that all this information and troubleshooting will be useful to other folks who try to use synology-csi with Nomad.

ressu commented 2 years ago

My instinct makes me nervous to mount the entire host filesystem inside the container. Would you mind describing why is this needed?

Trust me, you're not alone with this one. I wanted to work around the mount in other CSIs, but couldn't find a reliable way :laughing:

As far as I understand, the host mount allows the CSI to act as the host system while the container sandbox is in place. It's a cheap trick used quite often to reduce complexity of the code whenever there are too many dependencies to the host system. I've seen the same pattern being used in other CSIs and CNIs.

Thanks again @ressu for helping here! I'm hoping that all this information and troubleshooting will be useful to other folks who try to use synology-csi with Nomad.

Happy to help, I'm mostly stabbing in the dark since I've never run Nomad myself. But I'm just happy that you are able to make progress with the hints I'm able to give you.

johnnyplaydrums commented 2 years ago

I need to take a pause on this synology-csi <> Nomad work and will hopefully come back to it at a later date. For now I will close out this issue since all my open questions have been answered. I'll reopen this issue if/when I come back to it and new questions arise. If anyone finds this issue in the future and wants to know how my nomad job.hcl, volume.hcl, and related configuration ended up, please reach out. Thanks again for all the help!

mabunixda commented 2 years ago

The creation of a storage does work, but there is still some problem with the access mode within nomad:

$ nomad volume  status
Container Storage Interface
ID    Name  Plugin ID     Schedulable  Access Mode
test  test  synology true         <none>

my current configuration for the nomad csi plugin job is like this

job "plugin-synology" {
  type = "system"
  group "controller" {
    task "plugin" {
      driver = "docker"
      config {
        image = "docker.io/synology/synology-csi:v1.0.0"
        privileged = true
        volumes = [
          "local/csi.yaml:/etc/csi.yaml",
          "/:/host",
        ]
        args = [
          "--endpoint",
          "unix://csi/csi.sock",
          "--client-info",
          "/etc/csi.yaml",
        ]
      }
      template {
          destination = "local/csi.yaml"
          data = <<EOF
---
clients:
- host: 192.168.1.2
  port: 8443
  https: true
  username: nomad
  password: <password>
EOF
      }
      csi_plugin {
        id        = "synology"
        type      = "monolith"
        mount_dir = "/csi"
      }
      resources {
        cpu    = 256
        memory = 256
      }
    }
  }
}

and the volume definition for the nomad volume create is like

id        = "test"
name      = "test"
type      = "csi"
plugin_id = "synology"

capacity_min = "1GiB"
capacity_max = "2GiB"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  mount_flags = ["rw"]
}
johnnyplaydrums commented 2 years ago

Hi @mabunixda, I think I ran into a similar issue. I used the nomad volume register command instead of create. I got this error even when I had the access_mode defined within the capability, as you do: Error registering volume: Unexpected response code: 500 (rpc error: validation: missing access mode, missing attachment mode).

Interestingly, when I moved the access_mode and attachment_mode to the top level, outside the capability block, the nomad volume register command worked and the volume had the correct access mode. According to the docs, that's not how it should work, but maybe it's a mistake in the docs or it's changed in more recent versions of Nomad (I'm on 1.0.4). Here's my volume.hcl:

id              = "test"
name            = "test"
type            = "csi"
external_id     = "a53b447a-c52b-48e5-9810-943e3b527a68"
plugin_id       = "synology"
access_mode     = "single-node-reader-only"
attachment_mode = "file-system"

mount_options {
  fs_type     = "btrfs"
  mount_flags = ["noatime"]
}

context {
  dsm = "<dsm-ip>"
}

I didn't tried the nomad volume create command. Does it actually create the volume in Synology? If so, that's better because then I don't have to go hunting for the UUID for the external_id field.

mabunixda commented 2 years ago

@johnnyplaydrums yes the create actually creates a volume on my synology - but it does not get usable on nomad ..

johnnyplaydrums commented 2 years ago

@mabunixda does putting access_mode and attachment_mode outside the capability solve that issue for you?

mabunixda commented 2 years ago

no because that is not a valid syntax for nomad > 1.1.0

johnnyplaydrums commented 2 years ago

Ah I see ☹️

taveraluis commented 2 years ago

Hello everyone, I am quite interested on this thread and will be hitting this wall soon (Have not setup nomad on this new setup), I hope we can make this work together at some point :)

AndrewCooper commented 2 years ago

I started working on this also. I took a route of following the Stateful Workloads tutorial and copying what made sense from the synology-csi configs. My Fork, nomad stuff is in deploy/nomad/v1.2.5

I've been able to get a controller and node going. Both appear to be running, connect to DSM, and no errors showing in the docker logs.

I can create volumes, and these show up in SAN Manager in DSM with what appear to be the correct settings. In Nomad they also show as Schedulable but Access Mode for all volumes is no matter what I use as the access_mode for creation. If I try to use one of these volumes as a mount, the run fails. I assume because the volume doesn't appear to have the same access and attachment as the job.

    2022-02-03T21:18:51-06:00: Task Group "mysql-server" (failed to place 1 allocation):
      * Constraint "missing CSI Volume test2[0]": 1 nodes excluded by filter

At this point I suspect there's a miscommunication between nomad and the csi plugin for grabbing capabilities of a volume, but I'm not sure how to test it. I have set log-level=debug for controller and node which does print a lot of data. Not sure how to get something similar on the nomad side. Debug level logging in nomad doesn't seem to show any of the actual communication with the plugin.

travisghansen commented 2 years ago

This driver implements csi but does so with k8s-isms as you have discovered. I have a pure csi based driver that works with synology (and nomad) available here: https://github.com/democratic-csi/democratic-csi

matthiasschoger commented 1 year ago

My instinct makes me nervous to mount the entire host filesystem inside the container. Would you mind describing why is this needed?

Trust me, you're not alone with this one. I wanted to work around the mount in other CSIs, but couldn't find a reliable way 😆

As far as I understand, the host mount allows the CSI to act as the host system while the container sandbox is in place. It's a cheap trick used quite often to reduce complexity of the code whenever there are too many dependencies to the host system. I've seen the same pattern being used in other CSIs and CNIs.

Thanks again @ressu for helping here! I'm hoping that all this information and troubleshooting will be useful to other folks who try to use synology-csi with Nomad.

Happy to help, I'm mostly stabbing in the dark since I've never run Nomad myself. But I'm just happy that you are able to make progress with the hints I'm able to give you.

Necro'ing this thread since I'm also banging my wall against a wall getting synology-csi to work on Nomad + my Synology DS220+.

I got Nomad running on my DS220+ with the latest DSM, but when I try to deploy the synology-csi I'm getting the following error messages in the systemd journal:

Mar 26 13:52:49 storage nomad[17771]: 2023-03-26T13:52:49.687+0200 [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin path=/volume1/homelab/nomad/var/lib/nomad/alloc/4969158d-6045-297a-a770-89b47d94e21f/synology-csi-plugin/secrets/api.sock error="listen unix /volume1/homelab/nomad/var/lib/nomad/alloc/4969158d-6045-297a-a770-89b47d94e21f/synology-csi-plugin/secrets/api.sock: bind: invalid argument"
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.634+0200 [ERROR] client.alloc_runner.task_runner.task_hook: killing task because plugin failed: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin error="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory"
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.634+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin type="Plugin became unhealthy" msg="Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory" failed=false
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.886+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin type=Killing msg="CSI plugin did not become healthy before configured 30s health timeout" failed=true
Mar 26 13:53:47 storage nomad[17771]: 2023-03-26T13:53:47.890+0200 [ERROR] client.alloc_runner.task_runner.task_hook: failed to kill task: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin kill_reason="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory" error="context canceled"

Nomad is running as root, therefore I think we can exclude permission issues.

Any idea what might cause these issues? The first line with "bind: illegal argument" looks like the culprit to me. I have the suspicion that the Linux version DSM is based on is too old, but some confirmation or recommendation to fix this would be nice.

awanaut commented 1 year ago

I was able to successfully get this working with Nomad 1.6.x using iSCSI. As stated above, the SMB portion relies on a k8s secret which Nomad cannot consume. Here is a working example of a csi-driver job, volume spec example, and an example job file consuming that volume for anyone who might need it. I'll submit a PR for docs at some point.

One of the things that may seem obvious, but I can't find listed anywhere is the need for the following packages installed on the host running Nomad:

I have not tested what exactly is required or not required. I found it on a blog and just installed them all and it works.

CSI Driver job - synology-csi.nomad.hcl

Using monolith here. Not really a need to break it out for homelab use which I assume is what most people are using Synology's for.

Run nomad job run synology-csi.nomad.hcl

job "synology-csi" {
  datacenters = ["dc1"]
  type        = "system"
  node_pool = "default"

  group "controller" {

    task "plugin" {
      driver = "docker"
      config {
        image        = "synology/synology-csi:v1.1.2"
        privileged   = true
        network_mode = "host"
          mount {
          type     = "bind"
          source   = "/"
          target   = "/host"
          readonly = false
        }
          mount {
          type     = "bind"
          source   = "local/csi.yaml"
          target   = "/etc/csi.yaml"
          readonly = true
        }

        args = [
          "--endpoint",
          "unix://csi/csi.sock",
          "--client-info",
          "/etc/csi.yaml"
        ]
      }
      template {
        data        = <<EOH
---
clients:
  - host: <ip of synology host>
    port: 5000
    https: false
    username: <username with admin privileges>
    password: <password>
EOH
      destination = "local/csi.yaml"
      }
      csi_plugin {
        id        = "synology"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Volume spec - example-volume.nomad.hcl

Run: nomad volume create example-volume.nomad.hcl

id        = "example"
name      = "example"
type      = "csi"
plugin_id = "synology"

capacity_min = "1GiB"
capacity_max = "2GiB"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

#mount/fstab options https://linux.die.net/man/8/mount
mount_options {
  fs_type     = "btrfs"
  mount_flags = ["noatime"] 
}

#if you have multiple storage pools and/or volumes, specify where to mount the container volume/LUN or else it'll just pick one for you
parameters {
  location = "/volume2" 
}

Validate its created and healthy by running nomad volume status

Example App Job - synology-csi-example.nomad.hcl

  datacenters = ["dc1"]
  node_pool = "default
"
  group "web" {
    count = 1
    volume "example_volume" {
      type            = "csi"
      read_only       = false
      source          = "example"
      access_mode     = "single-node-writer"
      attachment_mode = "file-system"
    }
    network {
      port "http" {
        static = 8888
        to     = 80
      }
    }

    task "nginx" {
      driver = "docker"
      volume_mount {
        volume      = "example_volume"
        destination = "/config"
        read_only   = false
      }
      config {
        image = "nginxdemos/hello:latest"
        ports = ["http"]
      }
    }
  }
}

Run nomad job run synology-csi-example.nomad.hcl ??? Profit!

gjrtimmer commented 11 months ago

@awanaut thank you for this information, im trying to set it up myself. Can you share some infonregarding the config on thr synology. Is it still required to create a lun with the k8s-csi prefix as mentioned above?

awanaut commented 11 months ago

@awanaut thank you for this information, im trying to set it up myself. Can you share some infonregarding the config on thr synology. Is it still required to create a lun with the k8s-csi prefix as mentioned above?

Nope! You just need to make sure the volume is available under the "parameter" stanza. nomad volume create will create the LUN on the backend. If you have a LUN already created and you wanted to use that, you'd use nomad volume register, however some of the config file parameters are different. Check here: https://developer.hashicorp.com/nomad/docs/commands/volume/register.

s4v4g3 commented 10 months ago

@awanaut Thanks a ton for figuring all this out. I was able to get it set up and working on my cluster.

I ran into one minor issue that I'm wondering if others have seen. When mounting an iSCSI volume into a task, the mount point is owned by root (uid/gid=0), with permissions of 755. This causes some apps, such as postgres, to fail since they run as a non-root user and try to chown their data directory on startup.

I got around this by creating a sidecar pre-start task that fixed the permissions on the volume before the main task runs, but I'm wondering if there's a better/cleaner way. I've experimented with a few settings without much luck.

gjrtimmer commented 9 months ago

@awanaut @s4v4g3 does the snapshot work in combination with Nomad?

gjrtimmer commented 9 months ago

@s4v4g3 Might this be related to the func createTargetMountPath which creates the mounted folder and sets the permissions to 0750 ? If this is the one causes the issue, where can we place a config item to change this so its not hardcoded anymore. Any thoughts?

awanaut commented 9 months ago

@awanaut @s4v4g3 does the snapshot work in combination with Nomad?

I have not tested CSI snapshots to see if they just use Synology's snapshots. I imagine they do.

gjrtimmer commented 9 months ago

@awanaut, any suggestion on how to define the nomad job, I'm struggling to convert the Kubernetes spec to nomad for snapshots.