Closed mrakopes closed 2 years ago
I guess this is the same issue as #124
I'd like to have all details you are willing to give me :D
linstor -m --output-version v1 r l pvc-d315b511-6cf6-4c0d-a9d4-851994252a46
Feel free to send me the information via email
Hi, I just faced with the same issue:
# linstor r l -r one-vm-8930-disk-5
+-----------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=================================================================|
| one-vm-8930-disk-5 | m14c14 | 55547 | InUse | Ok | UpToDate |
| one-vm-8930-disk-5 | m14c16 | 55547 | Unused | Ok | UpToDate |
+-----------------------------------------------------------------+
# linstor r c m8c9 one-vm-8930-disk-5 -s thindata
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
New resource 'one-vm-8930-disk-5' on node 'm8c9' registered.
Details:
Resource 'one-vm-8930-disk-5' on node 'm8c9' UUID is: f85b1729-4184-490f-8eb8-af2454746940
SUCCESS:
Description:
Volume with number '0' on resource 'one-vm-8930-disk-5' on node 'm8c9' successfully registered
Details:
Volume UUID is: 5f8f78d6-6177-4604-baca-2b93ae59fd60
ERROR:
Description:
(Node: 'm14c14') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
Verification of resource file failed
Details:
The error reported by the runtime environment or operating system is:
The external command 'drbdadm' exited with error code 10
Show reports:
linstor error-reports show 5EC7019E-F54E6-000002
ERROR:
Description:
(Node: 'm14c16') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
Verification of resource file failed
Details:
The error reported by the runtime environment or operating system is:
The external command 'drbdadm' exited with error code 10
Show reports:
linstor error-reports show 5EC7019E-A097F-000004
ERROR:
Description:
(Node: 'm8c9') Generated resource file for resource 'one-vm-8930-disk-5' is invalid.
Cause:
Verification of resource file failed
Details:
The error reported by the runtime environment or operating system is:
The external command 'drbdadm' exited with error code 10
Show reports:
linstor error-reports show 5EC70151-CFA0B-000003
command terminated with exit code 10
5EC7019E-F54E6-000004.log 5EC7019E-F54E6-000003.log 5EC7019E-F54E6-000002.log
# linstor r l -r one-vm-8930-disk-5
+-----------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=================================================================|
| one-vm-8930-disk-5 | m14c14 | 55547 | InUse | Ok | UpToDate |
| one-vm-8930-disk-5 | m14c16 | 55547 | Unused | Ok | UpToDate |
| one-vm-8930-disk-5 | m8c9 | 55547 | | | Unknown |
+-----------------------------------------------------------------+
# linstor -m --output-version v1 r l -r one-vm-8930-disk-5
[
[
{
"layer_object": {
"drbd": {
"node_id": 0,
"al_size": 32,
"al_stripes": 1,
"drbd_resource_definition": {
"transport_type": "IP",
"al_stripe_size_kib": 32,
"al_stripes": 1,
"peer_slots": 7,
"port": 55547,
"secret": "redacted",
"down": false
},
"connections": {
"m14c16": {
"message": "Connected",
"connected": true
}
},
"drbd_volumes": [
{
"backing_device": "/dev/data/one-vm-8930-disk-5_00000",
"allocated_size_kib": 209760048,
"device_path": "/dev/drbd2128",
"usable_size_kib": 209715200,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
}
],
"peer_slots": 7
},
"type": "DRBD",
"children": [
{
"storage": {
"storage_volumes": [
{
"disk_state": "[]",
"allocated_size_kib": 209760256,
"volume_number": 0,
"device_path": "/dev/data/one-vm-8930-disk-5_00000",
"usable_size_kib": 209760256
}
]
},
"type": "STORAGE"
}
]
},
"name": "one-vm-8930-disk-5",
"state": {
"in_use": true
},
"volumes": [
{
"storage_pool_name": "thindata",
"provider_kind": "LVM_THIN",
"state": {
"disk_state": "UpToDate"
},
"layer_data_list": [
{
"data": {
"backing_device": "/dev/data/one-vm-8930-disk-5_00000",
"allocated_size_kib": 209760048,
"device_path": "/dev/drbd2128",
"usable_size_kib": 209715200,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
},
"type": "DRBD"
},
{
"data": {
"disk_state": "[]",
"allocated_size_kib": 209760256,
"volume_number": 0,
"device_path": "/dev/data/one-vm-8930-disk-5_00000",
"usable_size_kib": 209760256
},
"type": "STORAGE"
}
],
"uuid": "9c37c85b-7c23-468b-bb53-afe367b1acbe",
"volume_number": 0,
"device_path": "/dev/drbd2128",
"allocated_size_kib": 154572336
}
],
"node_name": "m14c14",
"props": {
"StorPoolName": "thindata"
},
"uuid": "fda93f55-cf37-4f90-ba84-50b7bad4de93"
},
{
"layer_object": {
"drbd": {
"node_id": 1,
"al_size": 32,
"al_stripes": 1,
"drbd_resource_definition": {
"transport_type": "IP",
"al_stripe_size_kib": 32,
"al_stripes": 1,
"peer_slots": 7,
"port": 55547,
"secret": "redacted",
"down": false
},
"connections": {
"m14c14": {
"message": "Connected",
"connected": true
}
},
"drbd_volumes": [
{
"backing_device": "/dev/data/one-vm-8930-disk-5_00000",
"allocated_size_kib": 209760048,
"device_path": "/dev/drbd2128",
"usable_size_kib": 209715200,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
}
],
"peer_slots": 7
},
"type": "DRBD",
"children": [
{
"storage": {
"storage_volumes": [
{
"disk_state": "[]",
"allocated_size_kib": 209760256,
"volume_number": 0,
"device_path": "/dev/data/one-vm-8930-disk-5_00000",
"usable_size_kib": 209760256
}
]
},
"type": "STORAGE"
}
]
},
"name": "one-vm-8930-disk-5",
"state": {
"in_use": false
},
"volumes": [
{
"storage_pool_name": "thindata",
"provider_kind": "LVM_THIN",
"state": {
"disk_state": "UpToDate"
},
"layer_data_list": [
{
"data": {
"backing_device": "/dev/data/one-vm-8930-disk-5_00000",
"allocated_size_kib": 209760048,
"device_path": "/dev/drbd2128",
"usable_size_kib": 209715200,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
},
"type": "DRBD"
},
{
"data": {
"disk_state": "[]",
"allocated_size_kib": 209760256,
"volume_number": 0,
"device_path": "/dev/data/one-vm-8930-disk-5_00000",
"usable_size_kib": 209760256
},
"type": "STORAGE"
}
],
"uuid": "d6f14e3c-b816-4355-ac4b-9ac00ebbc177",
"volume_number": 0,
"device_path": "/dev/drbd2128",
"allocated_size_kib": 154572336
}
],
"node_name": "m14c16",
"props": {
"StorPoolName": "thindata",
"AutoSelectedStorPoolName": "thindata"
},
"uuid": "3bcf51f6-367f-4711-b87c-5b7c4a12326c"
},
{
"layer_object": {
"drbd": {
"node_id": 0,
"al_size": 32,
"al_stripes": 1,
"drbd_resource_definition": {
"transport_type": "IP",
"al_stripe_size_kib": 32,
"al_stripes": 1,
"peer_slots": 7,
"port": 55547,
"secret": "redacted",
"down": false
},
"drbd_volumes": [
{
"allocated_size_kib": -1,
"usable_size_kib": -1,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
}
],
"peer_slots": 7
},
"type": "DRBD",
"children": [
{
"storage": {
"storage_volumes": [
{
"disk_state": "[]",
"allocated_size_kib": -1,
"volume_number": 0,
"usable_size_kib": -1
}
]
},
"type": "STORAGE"
}
]
},
"name": "one-vm-8930-disk-5",
"uuid": "f85b1729-4184-490f-8eb8-af2454746940",
"volumes": [
{
"storage_pool_name": "thindata",
"provider_kind": "LVM_THIN",
"uuid": "5f8f78d6-6177-4604-baca-2b93ae59fd60",
"volume_number": 0,
"allocated_size_kib": 20976,
"layer_data_list": [
{
"data": {
"allocated_size_kib": -1,
"usable_size_kib": -1,
"drbd_volume_definition": {
"volume_number": 0,
"minor_number": 2128
}
},
"type": "DRBD"
},
{
"data": {
"disk_state": "[]",
"allocated_size_kib": -1,
"volume_number": 0,
"usable_size_kib": -1
},
"type": "STORAGE"
}
]
}
],
"node_name": "m8c9",
"props": {
"StorPoolName": "thindata"
}
}
]
]
root@m8c9:~# cat /var/lib/linstor.d/one-vm-8930-disk-5.res_tmp
# This file was generated by linstor(1.7.1), do not edit manually.
resource "one-vm-8930-disk-5"
{
template-file "linstor_common.conf";
options
{
on-no-quorum io-error;
quorum majority;
}
net
{
cram-hmac-alg sha1;
shared-secret "redacted";
}
on m8c9
{
volume 0
{
disk /dev/data/one-vm-8930-disk-5_00000;
disk
{
discard-zeroes-if-aligned yes;
rs-discard-granularity 65536;
}
meta-disk internal;
device minor 2128;
}
node-id 0;
}
on m14c14
{
volume 0
{
disk /dev/drbd/this/is/not/used;
disk
{
discard-zeroes-if-aligned yes;
rs-discard-granularity 65536;
}
meta-disk internal;
device minor 2128;
}
node-id 0;
}
on m14c16
{
volume 0
{
disk /dev/drbd/this/is/not/used;
disk
{
discard-zeroes-if-aligned yes;
rs-discard-granularity 65536;
}
meta-disk internal;
device minor 2128;
}
node-id 1;
}
connection
{
host m8c9 address ipv4 10.37.129.99:55547;
host m14c14 address ipv4 10.37.130.149:55547;
}
connection
{
host m8c9 address ipv4 10.37.129.99:55547;
host m14c16 address ipv4 10.37.130.151:55547;
}
}
problem gone after restarting controller
m8c9
got node-id 2 after the controller was restarted?
m8c9
got node-id 2 after the controller was restarted?
yep, node-id 2;
also, can someone of you also show me
linstor c lp
linstor rg lp $resource_group_name
linstor rd lp $resource_definition_name
whereas if you are not using any "special" resource group, than please show me the properties of DfltRscGrp
Sure:
# linstor c lp
+------------------------------------------------------------------+
| Key | Value |
|==================================================================|
| DrbdOptions/Net/after-sb-0pri | disconnect |
| DrbdOptions/Net/after-sb-1pri | disconnect |
| DrbdOptions/Net/after-sb-2pri | disconnect |
| DrbdOptions/Net/csums-alg | crc32 |
| DrbdOptions/Net/max-buffers | 36864 |
| DrbdOptions/Net/protocol | C |
| DrbdOptions/Net/rcvbuf-size | 2097152 |
| DrbdOptions/Net/sndbuf-size | 1048576 |
| DrbdOptions/Net/verify-alg | crc32 |
| DrbdOptions/PeerDevice/c-fill-target | 10240 |
| DrbdOptions/PeerDevice/c-max-rate | 737280 |
| DrbdOptions/PeerDevice/c-min-rate | 20480 |
| DrbdOptions/PeerDevice/c-plan-ahead | 10 |
| DrbdOptions/auto-add-quorum-tiebreaker | false |
| DrbdOptions/auto-quorum | io-error |
| TcpPortAutoRange | 55000-62000 |
| defaultDebugSslConnector | DebugSslConnector |
| defaultPlainConSvc | PlainConnector |
| defaultSslConSvc | SslConnector |
| netcom/DebugSslConnector/bindaddress | ::0 |
| netcom/DebugSslConnector/enabled | true |
| netcom/DebugSslConnector/keyPasswd | linstor |
| netcom/DebugSslConnector/keyStore | ssl/keystore.jks |
| netcom/DebugSslConnector/keyStorePasswd | linstor |
| netcom/DebugSslConnector/port | 3373 |
| netcom/DebugSslConnector/sslProtocol | TLSv1.2 |
| netcom/DebugSslConnector/trustStore | ssl/certificates.jks |
| netcom/DebugSslConnector/trustStorePasswd | linstor |
| netcom/DebugSslConnector/type | ssl |
| netcom/PlainConnector/bindaddress | 127.0.0.1 |
| netcom/PlainConnector/enabled | true |
| netcom/PlainConnector/port | 3376 |
| netcom/PlainConnector/type | plain |
| netcom/SslConnector/bindaddress | ::0 |
| netcom/SslConnector/enabled | true |
| netcom/SslConnector/keyPasswd | linstor |
| netcom/SslConnector/keyStore | ssl/keystore.jks |
| netcom/SslConnector/keyStorePasswd | linstor |
| netcom/SslConnector/port | 3377 |
| netcom/SslConnector/sslProtocol | TLSv1.2 |
| netcom/SslConnector/trustStore | ssl/certificates.jks |
| netcom/SslConnector/trustStorePasswd | linstor |
| netcom/SslConnector/type | ssl |
+------------------------------------------------------------------+
# linstor rd l -r one-vm-8930-disk-5
+----------------------------------------------------+
| ResourceName | Port | ResourceGroup | State |
|====================================================|
| one-vm-8930-disk-5 | 55547 | DfltRscGrp | ok |
+----------------------------------------------------+
# linstor rg l -r DfltRscGrp
+------------------------------------------------------+
| ResourceGroup | SelectFilter | VlmNrs | Description |
|======================================================|
| DfltRscGrp | PlaceCount: 2 | | |
+------------------------------------------------------+
# linstor rg lp DfltRscGrp
+-------------+
| Key | Value |
|=============|
+-------------+
# linstor rd lp one-vm-8930-disk-5
+----------------------------------------------+
| Key | Value |
|==============================================|
| Aux/one/DISK_ID | 5 |
| Aux/one/DS_ID | 200 |
| Aux/one/VM_ID | 8930 |
| DrbdOptions/Resource/on-no-quorum | io-error |
| DrbdOptions/Resource/quorum | majority |
| DrbdPrimarySetOn | M14C14 |
+----------------------------------------------+
Im sorry to double check this, but this is really giving me a headache right now... are you perfectly sure that you did not recreate the resource before / after restarting the controller... so basically
linstor r c ... -> boom
linstor r l -> broken resource
# restart controller
linstor r l -> working resource
no, I'm pretty sure that controller was working fine, I repeated this few times:
linstor r c m8c9 one-vm-8930-disk-5 -s thindata # wrong node_id
linstor r d m8c9 one-vm-8930-disk-5
linstor r c m8c9 one-vm-8930-disk-5 -s thindata # wrong node_id
linstor r d m8c9 one-vm-8930-disk-5
# restart controller
linstor r c m8c9 one-vm-8930-disk-5 -s thindata # normal node_id
and after you restart the controller, does this wrong node_id behaviour restart when you start deleting and re-creating that resource? or does it stay "stable" once the controller is restarted?
and after you restart the controller, does this wrong node_id behaviour restart when you start deleting and re-creating that resource? or does it stay "stable" once the controller is restarted?
Seems it is stable after restart.
I was able to create pvc-d315b511-6cf6-4c0d-a9d4-851994252a46
from the original post by @mrakopes as well.
I'm not sure if this related, we're also having a couple of issues when thinprovisioned storage-pool is running out of space, today was exactly this situation on some node (another one). The situation is going worse if you'll try to make snapshot of the resource located on overfilled node https://github.com/LINBIT/linstor-server/issues/138, in this way the snapshot is switching to failed state, and linstor do not resumes the i/o, thus libvirt and other software depended on this volume will stuck forever. I hope I will find enough time to collect more information and prepare proper bug report for you.
Now that you mentioned that you had troubles with overprovisioning... is it possible that this resource-definition already had 3 replicas, you had to delete one due to overprovisioning and when you tried to recreate it on a different third node, that ended up in this duplicated node-id issue?
No, this resource was not on overfilled node, but there were another overfilled node in a cluster unrelated to this resource.
My best bet right now is that the resource-definition got in a weird state (internal stuff, I'd like to spare you with the details :) ). This weird state would explain the deterministically wrong node-id, as well as why restarting the controller fixes this issue when the resource is recreated. This is just an assumption and also not enough to fix this issue.
So - anything helps. What did you do with this resource-definition before this node-id issue started? Anything you remember (or have logs for) could help...
Any information might help - even if that event was days or weeks ago. Limit should only be the last restart of the controller. The already deployed resource might have been just fine for days although this internal state of the resource-definition was already broken...
Mar 28 09:01 (UTC)
:
linstor rd c one-vm-8930-disk-5
linstor vd c one-vm-8930-disk-5 200g
linstor rd sp one-vm-8930-disk-5 --aux Aux/one/DISK_ID 5
linstor rd sp one-vm-8930-disk-5 --aux Aux/one/DS_ID 200
linstor rd sp one-vm-8930-disk-5 --aux Aux/one/VM_ID 8930
m14c14
node first:
linstor r c m14c14 one-vm-8930-disk-5 -s thindata
linstor r c --auto-place 2 --replicas-on-same opennebula-1 -s thindata
I don't see any other actions in my log, before the incident was happened, probably customer was just turn on and turn off his VM few times and installed some operating system on it.
There where two unrelated problems which might somehow affected on controller:
First issue was connected with the lack of space on one of the node, all resources on it become to diskless mode. (this is another node but it might be somehow connected)
Second thing is that lately we're actively started using snapshots, each backup cycle we're making snapshots like:
linstor s c m14c14 one-vm-8930-disk-5 backup
where m14c14
is random diskful node selected from the nodes for the resource
and new resource from this:
linstor s vd rst --fr one-vm-8930-disk-5 --fs backup --tr backup-one-vm-8930-disk-5
linstor s r rst --fr one-vm-8930-disk-5 --fs backup --tr backup-one-vm-8930-disk-5
afterwards snapshot is removed:
linstor s d one-vm-8930-disk-5 backup
then performs backup of backup-one-vm-8930-disk-5
resource on m14c14
then remove the resource definition:
linstor rd d backup-one-vm-8930-disk-5
We're using these steps to perform backup all our resources starting from one-vm-*
Time-to-time we're having stuck libvirt after that, currently I'm investigating this problem.
UPD: I file bug to drbd-user mail-list https://lists.linbit.com/pipermail/drbd-user/2020-May/025623.html
Today we faced with the similar issue, connected with node_ids:
The resources were flapping between Unconnected
and Connecting
states:
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m16c31 ┊ 8937 ┊ Unused ┊ Unconnected(m7c10),Connecting(m6c9) ┊ Diskless ┊ ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m6c9 ┊ 8937 ┊ Unused ┊ Connecting(m16c31) ┊ UpToDate ┊ ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m7c10 ┊ 8937 ┊ Unused ┊ Unconnected(m16c31) ┊ UpToDate ┊ ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
drbdadm status:
pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 role:Secondary
disk:Diskless quorum:no
m6c9 connection:Unconnected
m7c10 connection:NetworkFailure
dmesg logs:
[23358231.316843] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Peer expects me to have a node_id of 0 instead of 2
[23358231.316856] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Connecting -> NetworkFailure )
[23358231.355032] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Aborting remote state change 0 commit not possible
[23358231.355050] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Restarting sender thread
[23358231.355065] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Connection closed
[23358231.355074] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( NetworkFailure -> Unconnected )
[23358231.794960] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Unconnected -> Connecting )
[23358232.338901] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Peer expects me to have a node_id of 0 instead of 2
[23358232.338906] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Connecting -> NetworkFailure )
[23358232.370918] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Unconnected -> Connecting )
[23358232.387021] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Aborting remote state change 0 commit not possible
[23358232.387038] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Restarting sender thread
[23358232.387053] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: Connection closed
[23358232.387063] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( NetworkFailure -> Unconnected )
[23358232.886961] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Peer expects me to have a node_id of 0 instead of 2
[23358232.886983] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( Connecting -> NetworkFailure )
[23358232.955002] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Aborting remote state change 0 commit not possible
[23358232.955021] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Restarting sender thread
[23358232.955079] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: Connection closed
[23358232.955091] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m6c9: conn( NetworkFailure -> Unconnected )
[23358233.394939] drbd pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 m7c10: conn( Unconnected -> Connecting )
deleting of resource on m16c31 were stuck on DELETING state:
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-5c235869-cfe2-45ca-88ec-7b8df374147b ┊ m16c31 ┊ 55060 ┊ ┊ Ok ┊ DELETING ┊ ┊
┊ pvc-6bab50ce-b1ab-47e1-bce9-470e3f07bc26 ┊ m16c31 ┊ 55021 ┊ InUse ┊ Ok ┊ Diskless ┊ 2020-10-10 14:27:04 ┊
┊ pvc-a3024553-eeea-41d7-b91e-3ae47417bf73 ┊ m16c31 ┊ 9019 ┊ InUse ┊ Ok ┊ Diskless ┊ 2020-10-10 12:08:58 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m16c31 ┊ 8937 ┊ ┊ Connecting(m7c10,m6c9) ┊ DELETING ┊ ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m6c9 ┊ 8937 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ pvc-6c6c4c57-42e7-40de-af30-3359b1e53032 ┊ m7c10 ┊ 8937 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The linstor-controller restart solved this problem.
Unfortunately I saved res file just from m10c31 node:
@ghernadi I was able to reproduce it on clean installation:
# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor v l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m19c2 ┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ thindata ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 117.50 MiB ┊ Unused ┊ UpToDate ┊
┊ m19c2 ┊ test-res ┊ thindata ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 2.05 MiB ┊ Unused ┊ UpToDate ┊
┊ m20c2 ┊ test-res ┊ thindata ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 2.05 MiB ┊ Unused ┊ UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
#####
##### mounting /dev/drbd1000 on m19c2 and start writing
#####
# linstor v l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m19c2 ┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ thindata ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 117.50 MiB ┊ Unused ┊ UpToDate ┊
┊ m19c2 ┊ test-res ┊ thindata ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 237.66 MiB ┊ InUse ┊ UpToDate ┊
┊ m20c2 ┊ test-res ┊ thindata ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 237.66 MiB ┊ Unused ┊ UpToDate ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor n l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞═══════════════════════════════════════════════════════╡
┊ m19c2 ┊ SATELLITE ┊ 10.36.131.106:3367 (SSL) ┊ Online ┊
┊ m19c3 ┊ SATELLITE ┊ 10.36.131.107:3367 (SSL) ┊ Online ┊
┊ m20c2 ┊ SATELLITE ┊ 10.36.131.151:3367 (SSL) ┊ Online ┊
┊ m20c3 ┊ SATELLITE ┊ 10.36.131.152:3367 (SSL) ┊ Online ┊
╰───────────────────────────────────────────────────────╯
# linstor n i l m19c2
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭──────────────────────────────────────────────────────────────────╮
┊ m19c2 ┊ NetInterface ┊ IP ┊ Port ┊ EncryptionType ┊
╞══════════════════════════════════════════════════════════════════╡
┊ + ┊ data ┊ 10.37.131.106 ┊ ┊ ┊
┊ + StltCon ┊ default ┊ 10.36.131.106 ┊ 3367 ┊ SSL ┊
╰──────────────────────────────────────────────────────────────────╯
#####
##### Added IP from 10.39.0.0/16 network on m19c2 and m20c2
#####
# linstor n i m m19c2 data
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
NetInterface 'data' on node 'm19c2' modified.
Details:
NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074
# linstor n i m m19c2 data --ip 10.39.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
NetInterface 'data' on node 'm19c2' modified.
Details:
NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074
# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Connecting(m19c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor n i m m19c2 data --ip 10.37.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
NetInterface 'data' on node 'm19c2' modified.
Details:
NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074
# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor n i m m19c2 data --ip 10.39.131.106
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
SUCCESS:
Description:
NetInterface 'data' on node 'm19c2' modified.
Details:
NetInterface 'data' on node 'm19c2' UUID is: d9274de1-6303-4d0b-9dfb-e6b8b419f074
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ InUse ┊ Connecting(m19c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
#####
##### Added IP from 10.39.0.0/16 network on m19c3 and m20c3
#####
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ Unused ┊ Connecting(m20c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ ┊ Ok ┊ DELETING ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ SyncTarget(11.12%) ┊ 2021-07-16 08:34:27 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ Unused ┊ Connecting(m20c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m19c3 ┊ 7000 ┊ ┊ Ok ┊ DELETING ┊ 2021-06-25 14:14:14 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c3 ┊ 7000 ┊ Unused ┊ Connecting(m19c2) ┊ SyncTarget(21.83%) ┊ 2021-07-16 08:34:27 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r d m20c3 test-res
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
INFO:
The given resource will not be deleted but will be taken over as a linstor managed tiebreaker resource.
SUCCESS:
Removal of disk from resource 'test-res' on node 'm20c3' registered
SUCCESS:
Removed disk on 'm20c3'
SUCCESS:
Notified 'm19c2' that disk has been removed on 'm20c3'
SUCCESS:
Notified 'm20c2' that disk has been removed on 'm20c3'
# linstor r l -a
Defaulted container "linstor-controller" out of: linstor-controller, load-certs (init)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-64aa83b1-9c7e-44fa-b027-74c72f8d4237 ┊ m19c2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-06-25 15:18:12 ┊
┊ test-res ┊ m19c2 ┊ 7000 ┊ Unused ┊ Unconnected(m20c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c2 ┊ 7000 ┊ Unused ┊ Unconnected(m20c3) ┊ UpToDate ┊ 2021-06-25 14:14:15 ┊
┊ test-res ┊ m20c3 ┊ 7000 ┊ Unused ┊ NetworkFailure(m19c2,m20c2) ┊ TieBreaker ┊ 2021-07-16 08:34:27 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@m20c3:~# drbdadm status
test-res role:Secondary
disk:Diskless quorum:no
m19c2 connection:Unconnected
m20c2 connection:Unconnected
root@m19c2:~# drbdsetup status test-res --verbose
test-res node-id:0 role:Secondary suspended:no
volume:0 minor:1000 disk:UpToDate backing_dev:/dev/data/test-res_00000 quorum:yes blocked:no
m20c2 node-id:1 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
m20c3 node-id:2 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Off peer-disk:DUnknown resync-suspended:no
root@m20c2:~# drbdsetup status test-res --verbose
test-res node-id:1 role:Secondary suspended:no
volume:0 minor:1000 disk:UpToDate backing_dev:/dev/data/test-res_00000 quorum:yes blocked:no
m19c2 node-id:0 connection:Connected role:Secondary congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Established peer-disk:UpToDate resync-suspended:no
m20c3 node-id:2 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Off peer-disk:DUnknown resync-suspended:no
root@m20c3:~# drbdsetup status test-res --verbose
test-res node-id:3 role:Secondary suspended:no
volume:0 minor:1000 disk:Diskless client:yes backing_dev:none quorum:no blocked:no
m19c2 node-id:0 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Off peer-disk:DUnknown resync-suspended:no
m20c2 node-id:1 connection:Unconnected role:Unknown congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Off peer-disk:DUnknown resync-suspended:no
Additionally attaching log files and linstor database dump
Hopefully this information will help to solve the peer ids once and for all.
Thank you for the reproducer! I was able to fix the bug produced by these steps and also added a new test to our CI for this use-case. We just released 1.14.0-rc1 today but unfortunately this fix did not make it into today's rc1 release. However, this bugfix will be included in the next release (whether 1.14.0-rc2 or the actual 1.14.0 release, whatever will be the next after today's 1.14.0-rc1)
@ghernadi hooray glad to hear that!
Could you please clarify if this bug was related to the changing network interfaces configuration on the nodes or not?
Not related as I actually skipped that part in my reproduction :)
The actual bug was introduced with the shared-pool concept. Linstor had to learn that 2 shared-resources must share the same node-id (only one of them can be active at the same time, but both must use the same node-id to not confuse the other peers). That led to that linstor needs to recreate some internal layer-data when toggling disk as that recreation can figure out if a node-id needs to be reused (when shared) or not.. However, this bug was triggered by this recreation but for non-shared resources, where the node-ids were not all used in sequential order. That means the minimal test I used here was simply creating 2 diskful resources and let Linstor give them by default node-id 0 and 1 and the third (regardless if diskful or diskless) does not get the next node-id 2 but instead is forced to get node-id 3 (or anything higher would also have triggered this issue). With this setup the next toggle disk on the third node recreated those internal data and changed its node-id to 2.
So the fix was to simply remember the previous node-id during recreation of those internal layer-data (unless overridden by the shared-resource logic). That makes me quite sure that this has nothing to do with the network interface changes.
@ghernadi, thank you for the detailed explanation!
Hi. Today i ran into a problem when creating new replica for my volume:
There is some problem with new config generated on nodes:
linstor-conflict-err.txt
When checking the config file, there realy is a duplicate node-id:
It assigned already used node-id: 0 to a new node. The same conflict is in all 3 node's config files.
The question is, why it tries to use the same node-id. I tried to create the same resource on different node wich resulted in the same error. On the other hand i have successfully created resource on the same node for a different volume
Using linstor-server version
v1.7.1