aerospike / aerospike-kubernetes-operator

Kubernetes operator for the Aerospike database
https://docs.aerospike.com/cloud/kubernetes/operator
Apache License 2.0
92 stars 37 forks source link

not an Aerospike device but not erased #303

Closed mateusmuller closed 3 months ago

mateusmuller commented 3 months ago

Hello!

I'm running an AerospikeCluster via Aerospike Operator on top of AWS EKS. For a specific Aerospike namespace, we're using RAW devices to store data.

This has been tested with both EBS and Instance Store volumes. Both throw the same error after the init container initialization:

Jul 08 2024 14:40:41 GMT: INFO (drv_ssd): (drv_ssd.c:3514) opened device /dev/xvdf: usable size 74994155520, io-min-size 512
Jul 08 2024 14:40:41 GMT: INFO (drv_ssd): (drv_ssd.c:1067) /dev/xvdf has 8940 wblocks of size 8388608
Jul 08 2024 14:40:41 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2444) /dev/xvdf: not an Aerospike device but not erased - check config or erase device
Jul 08 2024 14:40:41 GMT: WARNING (as): (signal.c:259) SIGUSR1 received, aborting Aerospike Enterprise Edition build 7.1.0.0 os ubuntu22.04 arch x86_64 sha 719892f ee-sha 53fd619
Jul 08 2024 14:40:41 GMT: WARNING (as): (signal.c:293) si_code SI_TKILL (-6)

The pod keeps restarting on CrashLoopBackOff state.

I understand we have this documentation, but I believe this has to be automated somehow.

I tried adding blkdiscard commands to userdata, but it doesn't work. It only works if I add to a second init container via AerospikeCluster kind:

      initContainers:
        # custom container to wipe header 8MiB
        # ref:
        # https://aerospike.com/docs/server/operations/configure/storage/ssd_init
        - name: aerospike-init-custom
          image: aerospike/aerospike-kubernetes-init:2.2.1
          command:
            - sh
            - -c
          args:
            - |
              test ! -f /mnt/tmp/blkdiscard_executed && {
                blkdiscard -v -z --length 8MiB /dev/xvdf
                touch /mnt/tmp/blkdiscard_executed
              } || true
          securityContext:
            privileged: true

When this blkdiscard command is executed, then the pod can start properly. I'm pretty sure I shouldn't be doing this to get Aerospike working.

Do you have any recommendations? Shouldn't this initialization be handled by your init container? Or maybe I'm missing something?

Thank you.

abhishekdwivedi3060 commented 3 months ago

Hi @mateusmuller, The option you are looking for is initMethod in the AerospikeCluster CR. It initialises the volumes attached to the CR. For more info, refer: https://aerospike.com/docs/cloud/kubernetes/operator/Cluster-configuration-settings#blockfilesystem-volume-policy

Eg:

spec:
  size: 3 
  storage:
    filesystemVolumePolicy:
      initMethod: deleteFiles
      cascadeDelete: true
    blockVolumePolicy:
      cascadeDelete: true
      initMethod: dd
mateusmuller commented 3 months ago

Hey @abhishekdwivedi3060! Thank you.

I tried with blkdiscard, but it seems it doesn't initialize the device itself. The dd takes too much time.

Should I use dd in this case?

abhishekdwivedi3060 commented 3 months ago

blkdiscard only works for devices that support TRIM. Refer the NOTE/CAUTION section in the initMethod link shared above.

Use dd if blkdiscard is not working for you

mateusmuller commented 3 months ago

It's unfeasible to wait this much time for dd to run, at least on top of k8s:

└─[$] <> kgpw
NAME                     READY   STATUS     RESTARTS   AGE
aerospike-identity-1-0   0/2     Init:0/1   0          8m28s
aerospike-identity-1-1   0/2     Init:0/1   0          8m28s
aerospike-identity-1-2   0/2     Init:0/1   0          8m28s

If I change to blkdiscard, then it throws:

2024-07-22T14:31:25Z    INFO    init-setup  Starting initialisation for volume={podName:aerospike-identity-1-0 volumeMode:Block volumeName:nvme-ssd effectiveWipeMethod:dd effectiveInitMethod:blkdiscard aerospikeVolumePath:/dev/xvdf}
2024-07-22T14:31:25Z    INFO    init-setup  Command submitted [blkdiscard /workdir/block-volumes/nvme-ssd] for volume={podName:aerospike-identity-1-0 volumeMode:Block volumeName:nvme-ssd effectiveWipeMethod:dd effectiveInitMethod:blkdiscard aerospikeVolumePath:/dev/xvdf}
2024-07-22T14:31:28Z    INFO    init-setup  Execution completed {"cmd": ["blkdiscard", "/workdir/block-volumes/nvme-ssd"]}

But right after that, pod keeps crashing.

NAME                     READY   STATUS    RESTARTS      AGE
aerospike-identity-1-0   1/2     Running   3 (46s ago)   110s

Jul 22 2024 14:33:25 GMT: CRITICAL (drv_ssd): (drv_ssd.c:2354) /dev/xvdf: not an Aerospike device but not erased - check config or erase device

I'm using i4i.xlarge and it supports TRIM as can be seen on this table.

Any suggestions please?

mateusmuller commented 3 months ago

I might be wrong, but shouldn't this init command include the -z --length 8MiB to wipe the header?

sud82 commented 3 months ago

Hi @mateusmuller, Let me give some more details regarding the initialization, If you go through this doc, you will find that Aerospike suggests two ways to initialize the devices

Now, we have decided not to give the 1st option in the AKO because of the following reasons.

When the user performs initialization manually, the user can make a conscious decision to choose between the above two options based on the device. However, with AKO, there are chances that the user may set the 1st initialization method globally and then use that method forever, and for initializing the older aerospike devices as well. Consequently, it poses a danger to the data integrity. There is no foolproof way for AKO to figure out if the device being initialized is a new device or an older aerospike device. Therefore, we had to make a hard choice between user safety and user convenience. From our side, providing the feature is not a challenge. The main challenge is to ensure the safety of this feature.

Regarding the support for the blkdiscard. There are different types of the TRIM command. AKO only supports the TRIM command of this type.

mateusmuller commented 3 months ago

Thanks for the explanation @sud82! I'll handle the header cleaning on my own then, cheers!