LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
981 stars 76 forks source link

Can't create new replica - conflicting use of node-id #150

Closed mrakopes closed 2 years ago

mrakopes commented 4 years ago

Hi. Today i ran into a problem when creating new replica for my volume:

# linstor r c  m7c16 pvc-d315b511-6cf6-4c0d-a9d4-851994252a46 -s thindata
SUCCESS:
    Successfully set property key(s): StorPoolName
INFO:
    Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
    Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
Description:
    New resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' on node 'm7c16' registered.
Details:
    Resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' on node 'm7c16' UUID is: 73399c4d-66df-414f-81d4-bcf6c70b7814
SUCCESS:
Description:
    Volume with number '0' on resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' on node 'm7c16' successfully registered
Details:
    Volume UUID is: 009c147e-747d-48b6-bc2c-120fb5a5b3dc
ERROR:
Description:
    (Node: 'm7c27') Generated resource file for resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC70181-3AA8C-000008
ERROR:
Description:
    (Node: 'm6c17') Generated resource file for resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC7013B-71433-000006
ERROR:
Description:
    (Node: 'm7c16') Generated resource file for resource 'pvc-d315b511-6cf6-4c0d-a9d4-851994252a46' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC70186-25CFB-000007

There is some problem with new config generated on nodes:

linstor-conflict-err.txt

When checking the config file, there realy is a duplicate node-id:

resource "pvc-d315b511-6cf6-4c0d-a9d4-851994252a46"
{
    template-file "linstor_common.conf";

    options
    {
        on-no-quorum io-error;
        quorum majority;
    }

    net
    {
        cram-hmac-alg     sha1;
        shared-secret     "xxx";
    }

    on m7c16
    {
(...)
        node-id    0;
    }

    on m6c17
    {
(...)
        node-id    0;
    }

    on m7c27
    {
(...)
        node-id    1;
    }

It assigned already used node-id: 0 to a new node. The same conflict is in all 3 node's config files.

The question is, why it tries to use the same node-id. I tried to create the same resource on different node wich resulted in the same error. On the other hand i have successfully created resource on the same node for a different volume

Using linstor-server version v1.7.1

ghernadi commented 4 years ago

I guess this is the same issue as #124

I'd like to have all details you are willing to give me :D

Feel free to send me the information via email

kvaps commented 4 years ago

Hi, I just faced with the same issue:

# linstor r l -r one-vm-8930-disk-5
+-----------------------------------------------------------------+
| ResourceName       | Node   | Port  | Usage  | Conns |    State |
|=================================================================|
| one-vm-8930-disk-5 | m14c14 | 55547 | InUse  | Ok    | UpToDate |
| one-vm-8930-disk-5 | m14c16 | 55547 | Unused | Ok    | UpToDate |
+-----------------------------------------------------------------+
# linstor r c m8c9 one-vm-8930-disk-5 -s thindata
SUCCESS:
    Successfully set property key(s): StorPoolName
SUCCESS:
Description:
    New resource 'one-vm-8930-disk-5' on node 'm8c9' registered.
Details:
    Resource 'one-vm-8930-disk-5' on node 'm8c9' UUID is: f85b1729-4184-490f-8eb8-af2454746940
SUCCESS:
Description:
    Volume with number '0' on resource 'one-vm-8930-disk-5' on node 'm8c9' successfully registered
Details:
    Volume UUID is: 5f8f78d6-6177-4604-baca-2b93ae59fd60
ERROR:
Description:
    (Node: 'm14c14') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC7019E-F54E6-000002
ERROR:
Description:
    (Node: 'm14c16') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC7019E-A097F-000004
ERROR:
Description:
    (Node: 'm8c9') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
    Verification of resource file failed
Details:
    The error reported by the runtime environment or operating system is:
    The external command 'drbdadm' exited with error code 10
Show reports:
    linstor error-reports show 5EC70151-CFA0B-000003
command terminated with exit code 10

5EC7019E-F54E6-000004.log 5EC7019E-F54E6-000003.log 5EC7019E-F54E6-000002.log

# linstor r l -r one-vm-8930-disk-5
+-----------------------------------------------------------------+
| ResourceName       | Node   | Port  | Usage  | Conns |    State |
|=================================================================|
| one-vm-8930-disk-5 | m14c14 | 55547 | InUse  | Ok    | UpToDate |
| one-vm-8930-disk-5 | m14c16 | 55547 | Unused | Ok    | UpToDate |
| one-vm-8930-disk-5 | m8c9   | 55547 |        |       |  Unknown |
+-----------------------------------------------------------------+
# linstor -m --output-version v1 r l -r one-vm-8930-disk-5
[
  [
    {
      "layer_object": {
        "drbd": {
          "node_id": 0,
          "al_size": 32,
          "al_stripes": 1,
          "drbd_resource_definition": {
            "transport_type": "IP",
            "al_stripe_size_kib": 32,
            "al_stripes": 1,
            "peer_slots": 7,
            "port": 55547,
            "secret": "redacted",
            "down": false
          },
          "connections": {
            "m14c16": {
              "message": "Connected",
              "connected": true
            }
          },
          "drbd_volumes": [
            {
              "backing_device": "/dev/data/one-vm-8930-disk-5_00000",
              "allocated_size_kib": 209760048,
              "device_path": "/dev/drbd2128",
              "usable_size_kib": 209715200,
              "drbd_volume_definition": {
                "volume_number": 0,
                "minor_number": 2128
              }
            }
          ],
          "peer_slots": 7
        },
        "type": "DRBD",
        "children": [
          {
            "storage": {
              "storage_volumes": [
                {
                  "disk_state": "[]",
                  "allocated_size_kib": 209760256,
                  "volume_number": 0,
                  "device_path": "/dev/data/one-vm-8930-disk-5_00000",
                  "usable_size_kib": 209760256
                }
              ]
            },
            "type": "STORAGE"
          }
        ]
      },
      "name": "one-vm-8930-disk-5",
      "state": {
        "in_use": true
      },
      "volumes": [
        {
          "storage_pool_name": "thindata",
          "provider_kind": "LVM_THIN",
          "state": {
            "disk_state": "UpToDate"
          },
          "layer_data_list": [
            {
              "data": {
                "backing_device": "/dev/data/one-vm-8930-disk-5_00000",
                "allocated_size_kib": 209760048,
                "device_path": "/dev/drbd2128",
                "usable_size_kib": 209715200,
                "drbd_volume_definition": {
                  "volume_number": 0,
                  "minor_number": 2128
                }
              },
              "type": "DRBD"
            },
            {
              "data": {
                "disk_state": "[]",
                "allocated_size_kib": 209760256,
                "volume_number": 0,
                "device_path": "/dev/data/one-vm-8930-disk-5_00000",
                "usable_size_kib": 209760256
              },
              "type": "STORAGE"
            }
          ],
          "uuid": "9c37c85b-7c23-468b-bb53-afe367b1acbe",
          "volume_number": 0,
          "device_path": "/dev/drbd2128",
          "allocated_size_kib": 154572336
        }
      ],
      "node_name": "m14c14",
      "props": {
        "StorPoolName": "thindata"
      },
      "uuid": "fda93f55-cf37-4f90-ba84-50b7bad4de93"
    },
    {
      "layer_object": {
        "drbd": {
          "node_id": 1,
          "al_size": 32,
          "al_stripes": 1,
          "drbd_resource_definition": {
            "transport_type": "IP",
            "al_stripe_size_kib": 32,
            "al_stripes": 1,
            "peer_slots": 7,
            "port": 55547,
            "secret": "redacted",
            "down": false
          },
          "connections": {
            "m14c14": {
              "message": "Connected",
              "connected": true
            }
          },
          "drbd_volumes": [
            {
              "backing_device": "/dev/data/one-vm-8930-disk-5_00000",
              "allocated_size_kib": 209760048,
              "device_path": "/dev/drbd2128",
              "usable_size_kib": 209715200,
              "drbd_volume_definition": {
                "volume_number": 0,
                "minor_number": 2128
              }
            }
          ],
          "peer_slots": 7
        },
        "type": "DRBD",
        "children": [
          {
            "storage": {
              "storage_volumes": [
                {
                  "disk_state": "[]",
                  "allocated_size_kib": 209760256,
                  "volume_number": 0,
                  "device_path": "/dev/data/one-vm-8930-disk-5_00000",
                  "usable_size_kib": 209760256
                }
              ]
            },
            "type": "STORAGE"
          }
        ]
      },
      "name": "one-vm-8930-disk-5",
      "state": {
        "in_use": false
      },
      "volumes": [
        {
          "storage_pool_name": "thindata",
          "provider_kind": "LVM_THIN",
          "state": {
            "disk_state": "UpToDate"
          },
          "layer_data_list": [
            {
              "data": {
                "backing_device": "/dev/data/one-vm-8930-disk-5_00000",
                "allocated_size_kib": 209760048,
                "device_path": "/dev/drbd2128",
                "usable_size_kib": 209715200,
                "drbd_volume_definition": {
                  "volume_number": 0,
                  "minor_number": 2128
                }
              },
              "type": "DRBD"
            },
            {
              "data": {
                "disk_state": "[]",
                "allocated_size_kib": 209760256,
                "volume_number": 0,
                "device_path": "/dev/data/one-vm-8930-disk-5_00000",
                "usable_size_kib": 209760256
              },
              "type": "STORAGE"
            }
          ],
          "uuid": "d6f14e3c-b816-4355-ac4b-9ac00ebbc177",
          "volume_number": 0,
          "device_path": "/dev/drbd2128",
          "allocated_size_kib": 154572336
        }
      ],
      "node_name": "m14c16",
      "props": {
        "StorPoolName": "thindata",
        "AutoSelectedStorPoolName": "thindata"
      },
      "uuid": "3bcf51f6-367f-4711-b87c-5b7c4a12326c"
    },
    {
      "layer_object": {
        "drbd": {
          "node_id": 0,
          "al_size": 32,
          "al_stripes": 1,
          "drbd_resource_definition": {
            "transport_type": "IP",
            "al_stripe_size_kib": 32,
            "al_stripes": 1,
            "peer_slots": 7,
            "port": 55547,
            "secret": "redacted",
            "down": false
          },
          "drbd_volumes": [
            {
              "allocated_size_kib": -1,
              "usable_size_kib": -1,
              "drbd_volume_definition": {
                "volume_number": 0,
                "minor_number": 2128
              }
            }
          ],
          "peer_slots": 7
        },
        "type": "DRBD",
        "children": [
          {
            "storage": {
              "storage_volumes": [
                {
                  "disk_state": "[]",
                  "allocated_size_kib": -1,
                  "volume_number": 0,
                  "usable_size_kib": -1
                }
              ]
            },
            "type": "STORAGE"
          }
        ]
      },
      "name": "one-vm-8930-disk-5",
      "uuid": "f85b1729-4184-490f-8eb8-af2454746940",
      "volumes": [
        {
          "storage_pool_name": "thindata",
          "provider_kind": "LVM_THIN",
          "uuid": "5f8f78d6-6177-4604-baca-2b93ae59fd60",
          "volume_number": 0,
          "allocated_size_kib": 20976,
          "layer_data_list": [
            {
              "data": {
                "allocated_size_kib": -1,
                "usable_size_kib": -1,
                "drbd_volume_definition": {
                  "volume_number": 0,
                  "minor_number": 2128
                }
              },
              "type": "DRBD"
            },
            {
              "data": {
                "disk_state": "[]",
                "allocated_size_kib": -1,
                "volume_number": 0,
                "usable_size_kib": -1
              },
              "type": "STORAGE"
            }
          ]
        }
      ],
      "node_name": "m8c9",
      "props": {
        "StorPoolName": "thindata"
      }
    }
  ]
]
root@m8c9:~# cat /var/lib/linstor.d/one-vm-8930-disk-5.res_tmp 
# This file was generated by linstor(1.7.1), do not edit manually.

resource "one-vm-8930-disk-5"
{
    template-file "linstor_common.conf";

    options
    {
        on-no-quorum io-error;
        quorum majority;
    }

    net
    {
        cram-hmac-alg     sha1;
        shared-secret     "redacted";
    }

    on m8c9
    {
        volume 0
        {
            disk        /dev/data/one-vm-8930-disk-5_00000;
            disk
            {
                discard-zeroes-if-aligned yes;
                rs-discard-granularity 65536;
            }
            meta-disk   internal;
            device      minor 2128;
        }
        node-id    0;
    }

    on m14c14
    {
        volume 0
        {
            disk        /dev/drbd/this/is/not/used;
            disk
            {
                discard-zeroes-if-aligned yes;
                rs-discard-granularity 65536;
            }
            meta-disk   internal;
            device      minor 2128;
        }
        node-id    0;
    }

    on m14c16
    {
        volume 0
        {
            disk        /dev/drbd/this/is/not/used;
            disk
            {
                discard-zeroes-if-aligned yes;
                rs-discard-granularity 65536;
            }
            meta-disk   internal;
            device      minor 2128;
        }
        node-id    1;
    }

    connection
    {
        host m8c9 address ipv4 10.37.129.99:55547;
        host m14c14 address ipv4 10.37.130.149:55547;
    }

    connection
    {
        host m8c9 address ipv4 10.37.129.99:55547;
        host m14c16 address ipv4 10.37.130.151:55547;
    }
}
kvaps commented 4 years ago

problem gone after restarting controller

ghernadi commented 4 years ago

m8c9 got node-id 2 after the controller was restarted?

kvaps commented 4 years ago

m8c9 got node-id 2 after the controller was restarted?

yep, node-id 2;

ghernadi commented 4 years ago

also, can someone of you also show me

linstor c lp
linstor rg lp $resource_group_name
linstor rd lp $resource_definition_name

whereas if you are not using any "special" resource group, than please show me the properties of DfltRscGrp

kvaps commented 4 years ago

Sure:

# linstor c lp 
+------------------------------------------------------------------+
| Key                                       | Value                |
|==================================================================|
| DrbdOptions/Net/after-sb-0pri             | disconnect           |
| DrbdOptions/Net/after-sb-1pri             | disconnect           |
| DrbdOptions/Net/after-sb-2pri             | disconnect           |
| DrbdOptions/Net/csums-alg                 | crc32                |
| DrbdOptions/Net/max-buffers               | 36864                |
| DrbdOptions/Net/protocol                  | C                    |
| DrbdOptions/Net/rcvbuf-size               | 2097152              |
| DrbdOptions/Net/sndbuf-size               | 1048576              |
| DrbdOptions/Net/verify-alg                | crc32                |
| DrbdOptions/PeerDevice/c-fill-target      | 10240                |
| DrbdOptions/PeerDevice/c-max-rate         | 737280               |
| DrbdOptions/PeerDevice/c-min-rate         | 20480                |
| DrbdOptions/PeerDevice/c-plan-ahead       | 10                   |
| DrbdOptions/auto-add-quorum-tiebreaker    | false                |
| DrbdOptions/auto-quorum                   | io-error             |
| TcpPortAutoRange                          | 55000-62000          |
| defaultDebugSslConnector                  | DebugSslConnector    |
| defaultPlainConSvc                        | PlainConnector       |
| defaultSslConSvc                          | SslConnector         |
| netcom/DebugSslConnector/bindaddress      | ::0                  |
| netcom/DebugSslConnector/enabled          | true                 |
| netcom/DebugSslConnector/keyPasswd        | linstor              |
| netcom/DebugSslConnector/keyStore         | ssl/keystore.jks     |
| netcom/DebugSslConnector/keyStorePasswd   | linstor              |
| netcom/DebugSslConnector/port             | 3373                 |
| netcom/DebugSslConnector/sslProtocol      | TLSv1.2              |
| netcom/DebugSslConnector/trustStore       | ssl/certificates.jks |
| netcom/DebugSslConnector/trustStorePasswd | linstor              |
| netcom/DebugSslConnector/type             | ssl                  |
| netcom/PlainConnector/bindaddress         | 127.0.0.1            |
| netcom/PlainConnector/enabled             | true                 |
| netcom/PlainConnector/port                | 3376                 |
| netcom/PlainConnector/type                | plain                |
| netcom/SslConnector/bindaddress           | ::0                  |
| netcom/SslConnector/enabled               | true                 |
| netcom/SslConnector/keyPasswd             | linstor              |
| netcom/SslConnector/keyStore              | ssl/keystore.jks     |
| netcom/SslConnector/keyStorePasswd        | linstor              |
| netcom/SslConnector/port                  | 3377                 |
| netcom/SslConnector/sslProtocol           | TLSv1.2              |
| netcom/SslConnector/trustStore            | ssl/certificates.jks |
| netcom/SslConnector/trustStorePasswd      | linstor              |
| netcom/SslConnector/type                  | ssl                  |
+------------------------------------------------------------------+
# linstor rd l -r one-vm-8930-disk-5
+----------------------------------------------------+
| ResourceName       | Port  | ResourceGroup | State |
|====================================================|
| one-vm-8930-disk-5 | 55547 | DfltRscGrp    | ok    |
+----------------------------------------------------+
# linstor rg l -r DfltRscGrp
+------------------------------------------------------+
| ResourceGroup | SelectFilter  | VlmNrs | Description |
|======================================================|
| DfltRscGrp    | PlaceCount: 2 |        |             |
+------------------------------------------------------+
# linstor rg lp DfltRscGrp
+-------------+
| Key | Value |
|=============|
+-------------+
# linstor rd lp one-vm-8930-disk-5
+----------------------------------------------+
| Key                               | Value    |
|==============================================|
| Aux/one/DISK_ID                   | 5        |
| Aux/one/DS_ID                     | 200      |
| Aux/one/VM_ID                     | 8930     |
| DrbdOptions/Resource/on-no-quorum | io-error |
| DrbdOptions/Resource/quorum       | majority |
| DrbdPrimarySetOn                  | M14C14   |
+----------------------------------------------+
ghernadi commented 4 years ago

Im sorry to double check this, but this is really giving me a headache right now... are you perfectly sure that you did not recreate the resource before / after restarting the controller... so basically

linstor r c ... -> boom
linstor r l -> broken resource
# restart controller
linstor r l -> working resource
kvaps commented 4 years ago

no, I'm pretty sure that controller was working fine, I repeated this few times:

linstor r c m8c9 one-vm-8930-disk-5 -s thindata # wrong node_id
linstor r d m8c9 one-vm-8930-disk-5
linstor r c m8c9 one-vm-8930-disk-5 -s thindata # wrong node_id
linstor r d m8c9 one-vm-8930-disk-5
# restart controller
linstor r c m8c9 one-vm-8930-disk-5 -s thindata # normal node_id
ghernadi commented 4 years ago

and after you restart the controller, does this wrong node_id behaviour restart when you start deleting and re-creating that resource? or does it stay "stable" once the controller is restarted?

kvaps commented 4 years ago

and after you restart the controller, does this wrong node_id behaviour restart when you start deleting and re-creating that resource? or does it stay "stable" once the controller is restarted?

Seems it is stable after restart.

I was able to create pvc-d315b511-6cf6-4c0d-a9d4-851994252a46 from the original post by @mrakopes as well.

I'm not sure if this related, we're also having a couple of issues when thinprovisioned storage-pool is running out of space, today was exactly this situation on some node (another one). The situation is going worse if you'll try to make snapshot of the resource located on overfilled node https://github.com/LINBIT/linstor-server/issues/138, in this way the snapshot is switching to failed state, and linstor do not resumes the i/o, thus libvirt and other software depended on this volume will stuck forever. I hope I will find enough time to collect more information and prepare proper bug report for you.

ghernadi commented 4 years ago

Now that you mentioned that you had troubles with overprovisioning... is it possible that this resource-definition already had 3 replicas, you had to delete one due to overprovisioning and when you tried to recreate it on a different third node, that ended up in this duplicated node-id issue?

kvaps commented 4 years ago

No, this resource was not on overfilled node, but there were another overfilled node in a cluster unrelated to this resource.

ghernadi commented 4 years ago

My best bet right now is that the resource-definition got in a weird state (internal stuff, I'd like to spare you with the details :) ). This weird state would explain the deterministically wrong node-id, as well as why restarting the controller fixes this issue when the resource is recreated. This is just an assumption and also not enough to fix this issue.

So - anything helps. What did you do with this resource-definition before this node-id issue started? Anything you remember (or have logs for) could help...

Any information might help - even if that event was days or weeks ago. Limit should only be the last restart of the controller. The already deployed resource might have been just fine for days although this internal state of the resource-definition was already broken...

kvaps commented 4 years ago

There where two unrelated problems which might somehow affected on controller:

First issue was connected with the lack of space on one of the node, all resources on it become to diskless mode. (this is another node but it might be somehow connected)

Second thing is that lately we're actively started using snapshots, each backup cycle we're making snapshots like:

We're using these steps to perform backup all our resources starting from one-vm-* Time-to-time we're having stuck libvirt after that, currently I'm investigating this problem.

UPD: I file bug to drbd-user mail-list https://lists.linbit.com/pipermail/drbd-user/2020-May/025623.html

kvaps commented 4 years ago

Today we faced with the similar issue, connected with node_ids:

The resources were flapping between Unconnected and Connecting states:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port ┊ Usage  ┊ Conns                               ┊    State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m16c31 ┊ 8937 ┊ Unused ┊ Unconnected(m7c10),Connecting(m6c9) ┊ Diskless ┊           ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m6c9   ┊ 8937 ┊ Unused ┊ Connecting(m16c31)                  ┊ UpToDate ┊           ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m7c10  ┊ 8937 ┊ Unused ┊ Unconnected(m16c31)                 ┊ UpToDate ┊           ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

drbdadm status:

pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 role:Secondary
  disk:Diskless quorum:no
  m6c9 connection:Unconnected
  m7c10 connection:NetworkFailure

dmesg logs:

[23358231.316843] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Peer expects me to have a node_id of 0 instead of 2
[23358231.316856] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Connecting -> NetworkFailure )
[23358231.355032] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Aborting remote state change 0 commit not possible
[23358231.355050] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Restarting sender thread
[23358231.355065] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Connection closed
[23358231.355074] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( NetworkFailure -> Unconnected )
[23358231.794960] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Unconnected -> Connecting )
[23358232.338901] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Peer expects me to have a node_id of 0 instead of 2
[23358232.338906] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Connecting -> NetworkFailure )
[23358232.370918] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Unconnected -> Connecting )
[23358232.387021] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Aborting remote state change 0 commit not possible
[23358232.387038] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Restarting sender thread
[23358232.387053] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Connection closed
[23358232.387063] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( NetworkFailure -> Unconnected )
[23358232.886961] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Peer expects me to have a node_id of 0 instead of 2
[23358232.886983] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Connecting -> NetworkFailure )
[23358232.955002] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Aborting remote state change 0 commit not possible
[23358232.955021] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Restarting sender thread
[23358232.955079] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Connection closed
[23358232.955091] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( NetworkFailure -> Unconnected )
[23358233.394939] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Unconnected -> Connecting )

deleting of resource on m16c31 were stuck on DELETING state:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port  ┊ Usage ┊ Conns ┊    State ┊ CreatedOn           ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-5c235869-cfe2-45ca-88ec-7b8df374147b ┊ m16c31 ┊ 55060 ┊       ┊ Ok    ┊ DELETING ┊                     ┊
┊ pvc-6bab50ce-b1ab-47e1-bce9-470e3f07bc26 ┊ m16c31 ┊ 55021 ┊ InUse ┊ Ok    ┊ Diskless ┊ 2020-10-10 14:27:04 ┊
┊ pvc-a3024553-eeea-41d7-b91e-3ae47417bf73 ┊ m16c31 ┊ 9019  ┊ InUse ┊ Ok    ┊ Diskless ┊ 2020-10-10 12:08:58 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port ┊ Usage  ┊ Conns                  ┊    State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m16c31 ┊ 8937 ┊        ┊ Connecting(m7c10,m6c9) ┊ DELETING ┊           ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m6c9   ┊ 8937 ┊ Unused ┊ Ok                     ┊ UpToDate ┊           ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m7c10  ┊ 8937 ┊ Unused ┊ Ok                     ┊ UpToDate ┊           ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The linstor-controller restart solved this problem.

Unfortunately I saved res file just from m10c31 node:

pvc-6c6c4c57-42e7-40de-af30-3359b1e53032.res.txt

kvaps commented 3 years ago

@ghernadi I was able to reproduce it on clean installation:

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor v l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ Resource                                 ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊  Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m19c2 ┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ thindata    ┊     0 ┊    1001 ┊ /dev/drbd1001 ┊ 117.50 MiB ┊ Unused ┊ UpToDate ┊
┊ m19c2 ┊ test-res                                 ┊ thindata    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊   2.05 MiB ┊ Unused ┊ UpToDate ┊
┊ m20c2 ┊ test-res                                 ┊ thindata    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊   2.05 MiB ┊ Unused ┊ UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

#####
##### mounting /dev/drbd1000 on m19c2 and start writing
#####

# linstor v l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ Resource                                 ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊  Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m19c2 ┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ thindata    ┊     0 ┊    1001 ┊ /dev/drbd1001 ┊ 117.50 MiB ┊ Unused ┊ UpToDate ┊
┊ m19c2 ┊ test-res                                 ┊ thindata    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 237.66 MiB ┊ InUse  ┊ UpToDate ┊
┊ m20c2 ┊ test-res                                 ┊ thindata    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 237.66 MiB ┊ Unused ┊ UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor n l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses                ┊ State  ┊
╞═══════════════════════════════════════════════════════╡
┊ m19c2 ┊ SATELLITE ┊ 10.36.131.106:3367 (SSL) ┊ Online ┊
┊ m19c3 ┊ SATELLITE ┊ 10.36.131.107:3367 (SSL) ┊ Online ┊
┊ m20c2 ┊ SATELLITE ┊ 10.36.131.151:3367 (SSL) ┊ Online ┊
┊ m20c3 ┊ SATELLITE ┊ 10.36.131.152:3367 (SSL) ┊ Online ┊
╰───────────────────────────────────────────────────────╯

# linstor n i l m19c2
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭──────────────────────────────────────────────────────────────────╮
┊ m19c2     ┊ NetInterface ┊ IP            ┊ Port ┊ EncryptionType ┊
╞══════════════════════════════════════════════════════════════════╡
┊ +         ┊ data         ┊ 10.37.131.106 ┊      ┊                ┊
┊ + StltCon ┊ default      ┊ 10.36.131.106 ┊ 3367 ┊ SSL            ┊
╰──────────────────────────────────────────────────────────────────╯

#####
##### Added IP from 10.39.0.0/16 network on m19c2 and m20c2
#####

# linstor n i m m19c2 data
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
    NetInterface 'data' on node 'm19c2' modified.
Details:
    NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074

# linstor n i m m19c2 data --ip 10.39.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
    NetInterface 'data' on node 'm19c2' modified.
Details:
    NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns             ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok                ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Connecting(m19c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok                ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor n i m m19c2 data --ip 10.37.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
    NetInterface 'data' on node 'm19c2' modified.
Details:
    NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor n i m m19c2 data --ip 10.39.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
    NetInterface 'data' on node 'm19c2' modified.
Details:
    NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns             ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok                ┊   UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ InUse  ┊ Connecting(m19c3) ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok                ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

#####
##### Added IP from 10.39.0.0/16 network on m19c3 and m20c3
#####

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns             ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok                ┊           UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ Unused ┊ Connecting(m20c3) ┊           UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊        ┊ Ok                ┊           DELETING ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok                ┊           UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ SyncTarget(11.12%) ┊ 2021-07-16 08:34:27 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns             ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok                ┊           UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ Unused ┊ Connecting(m20c3) ┊           UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m19c3 ┊ 7000 ┊        ┊ Ok                ┊           DELETING ┊ 2021-06-25 14:14:14 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok                ┊           UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ SyncTarget(21.83%) ┊ 2021-07-16 08:34:27 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r d m20c3 test-res
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
INFO:
    The given resource will not be deleted but will be taken over as a linstor managed tiebreaker resource.
SUCCESS:
    Removal of disk from resource 'test-res' on node 'm20c3' registered
SUCCESS:
    Removed disk on 'm20c3'
SUCCESS:
    Notified 'm19c2' that disk has been removed on 'm20c3'
SUCCESS:
    Notified 'm20c2' that disk has been removed on 'm20c3'

# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port ┊ Usage  ┊ Conns                       ┊      State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok                          ┊   UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res                                 ┊ m19c2 ┊ 7000 ┊ Unused ┊ Unconnected(m20c3)          ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c2 ┊ 7000 ┊ Unused ┊ Unconnected(m20c3)          ┊   UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res                                 ┊ m20c3 ┊ 7000 ┊ Unused ┊ NetworkFailure(m19c2,m20c2) ┊ TieBreaker ┊ 2021-07-16 08:34:27 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

root@m20c3:~# drbdadm status
test-res role:Secondary
  disk:Diskless quorum:no
  m19c2 connection:Unconnected
  m20c2 connection:Unconnected
root@m19c2:~# drbdsetup status test-res --verbose
test-res node-id:0 role:Secondary suspended:no
  volume:0 minor:1000 disk:UpToDate backing_dev:/dev/data/test-res_00000 quorum:yes blocked:no
  m20c2 node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
  m20c3 node-id:2 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:no

root@m20c2:~# drbdsetup status test-res --verbose
test-res node-id:1 role:Secondary suspended:no
  volume:0 minor:1000 disk:UpToDate backing_dev:/dev/data/test-res_00000 quorum:yes blocked:no
  m19c2 node-id:0 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
  m20c3 node-id:2 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:no

root@m20c3:~# drbdsetup status test-res --verbose
test-res node-id:3 role:Secondary suspended:no
  volume:0 minor:1000 disk:Diskless client:yes backing_dev:none quorum:no blocked:no
  m19c2 node-id:0 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:no
  m20c2 node-id:1 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:no

Additionally attaching log files and linstor database dump

Hopefully this information will help to solve the peer ids once and for all.

ghernadi commented 3 years ago

Thank you for the reproducer! I was able to fix the bug produced by these steps and also added a new test to our CI for this use-case. We just released 1.14.0-rc1 today but unfortunately this fix did not make it into today's rc1 release. However, this bugfix will be included in the next release (whether 1.14.0-rc2 or the actual 1.14.0 release, whatever will be the next after today's 1.14.0-rc1)

kvaps commented 3 years ago

@ghernadi hooray glad to hear that!

Could you please clarify if this bug was related to the changing network interfaces configuration on the nodes or not?

ghernadi commented 3 years ago

Not related as I actually skipped that part in my reproduction :)

The actual bug was introduced with the shared-pool concept. Linstor had to learn that 2 shared-resources must share the same node-id (only one of them can be active at the same time, but both must use the same node-id to not confuse the other peers). That led to that linstor needs to recreate some internal layer-data when toggling disk as that recreation can figure out if a node-id needs to be reused (when shared) or not.. However, this bug was triggered by this recreation but for non-shared resources, where the node-ids were not all used in sequential order. That means the minimal test I used here was simply creating 2 diskful resources and let Linstor give them by default node-id 0 and 1 and the third (regardless if diskful or diskless) does not get the next node-id 2 but instead is forced to get node-id 3 (or anything higher would also have triggered this issue). With this setup the next toggle disk on the third node recreated those internal data and changed its node-id to 2.

So the fix was to simply remember the previous node-id during recreation of those internal layer-data (unless overridden by the shared-resource logic). That makes me quite sure that this has nothing to do with the network interface changes.

kvaps commented 3 years ago

@ghernadi, thank you for the detailed explanation!