LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
984 stars 76 forks source link

Error retrieving storage pools after updating (1.24.2->1.25.0) in empty LINSTOR cluster #375

Closed AleksZimin closed 8 months ago

AleksZimin commented 1 year ago

Hello,

We recently updated our LINSTOR cluster from version 1.24.2 to 1.25.0. Our cluster had no resources and only had DfltDisklessStorPool storage pools.

After the update to version 1.25.0, we started encountering an error while trying to retrieve the storage pools. The error message is as follows:

# linstor sp l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node                 ┊ Driver   ┊ PoolName ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State   ┊ SharedName                                ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ offline-stand-stor-1 ┊ DISKLESS ┊          ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-1;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-2 ┊ DISKLESS ┊          ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-2;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-3 ┊ DISKLESS ┊          ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-3;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-4 ┊ DISKLESS ┊          ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-4;DfltDisklessStorPool ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-1'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-2'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-3'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-4'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

meanwhile, all satellites were displayed as online:

# linstor n l
╭──────────────────────────────────────────────────────────────────────────────────────╮
┊ Node                                ┊ NodeType   ┊ Addresses               ┊ State   ┊
╞══════════════════════════════════════════════════════════════════════════════════════╡
┊ linstor-controller-75f9f656c4-ctf5g ┊ CONTROLLER ┊ 10.111.2.254:3367 (SSL) ┊ OFFLINE ┊
┊ offline-stand-stor-1                ┊ SATELLITE  ┊ 172.20.1.2:3367 (SSL)   ┊ Online  ┊
┊ offline-stand-stor-2                ┊ SATELLITE  ┊ 172.20.1.3:3367 (SSL)   ┊ Online  ┊
┊ offline-stand-stor-3                ┊ SATELLITE  ┊ 172.20.1.4:3367 (SSL)   ┊ Online  ┊
┊ offline-stand-stor-4                ┊ SATELLITE  ┊ 172.20.1.5:3367 (SSL)   ┊ Online  ┊
╰──────────────────────────────────────────────────────────────────────────────────────╯

The controller went offline since we had disabled the Piraeus Operator to prevent extra messages in the logs, and we restarted the pod with the controller several times during our investigation. Since the Piraeus Operator was disabled, the registration of the controller from the new pod was not carried out. Turning on the Piraeus Operator triggers the registration of the pod with the controller, bringing all nodes to an online state; however, this action does not resolve the issue with retrieving the storage pool.

As a workaround, we created a storage pool for one of the nodes, and thereafter the connection to that particular node got established successfully. However, the problem persists with the other nodes:

linstor sp l
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node                 ┊ Driver   ┊ PoolName           ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State   ┊ SharedName                                ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ offline-stand-stor-1 ┊ DISKLESS ┊                    ┊              ┊               ┊ False        ┊ Ok      ┊ offline-stand-stor-1;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-2 ┊ DISKLESS ┊                    ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-2;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-3 ┊ DISKLESS ┊                    ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-3;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ offline-stand-stor-4 ┊ DISKLESS ┊                    ┊              ┊               ┊ False        ┊ Warning ┊ offline-stand-stor-4;DfltDisklessStorPool ┊
┊ thindata             ┊ offline-stand-stor-1 ┊ LVM_THIN ┊ vg-thin-0/thindata ┊   299.85 GiB ┊    299.85 GiB ┊ True         ┊ Ok      ┊ offline-stand-stor-1;thindata             ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-2'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-3'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.
WARNING:
Description:
    No active connection to satellite 'offline-stand-stor-4'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

We created an sos report, then recreated all pods in the LINSTOR namespace and generated another sos report. Additionally, we created an sos report after creating a storage pool on one of the nodes. Files will be attached to this issue.

Your assistance with this issue would be highly appreciated. Thank you for your time and support.

sos_2023-11-04_20-24-13.tar.gz sos_2023-11-04_20-43-00_after_delete_all_pods.tar.gz sos_2023-11-04_20-49-14_after_create_sp.tar.gz

tampler commented 1 year ago

Had this issue as well. Fixed by simply removing and recreating nodes. PS: node name must match the node hostname

ghernadi commented 1 year ago

I believe we have found and fixed this issue. Until the fix is released, a controller restart should solve the problem.

crioman commented 1 year ago

Got the same problem. Standalone Linstor cluster v 1.25 After adding satellites all working fine. But DfltDisklessStorPool becomes in warning state after linstor-controller service restart or starts on another node. The only way to bring it back to Ok state - delete and add node. Anyway - all diskless resources are creating and working fine.