Closed kvaps closed 2 years ago
Can you please also attach error reports 5F842882-00000-015924
as well as 5F842882-00000-015925
I did not see such an error report often as you are showing here... quite strange..
However, just wondering - are you aware of these error reports?
18:35:54.935 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - Resource on node m11c11 has insufficient peer slots to add another peer [Report number 5F842882-00000-015917]
18:43:57.915 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - Resource on node m11c11 has insufficient peer slots to add another peer [Report number 5F842882-00000-015918]
If I am not mistaken, the default peer-count is 7.. do you have more resources in one resource-definition?
After the restart linstor-controller started listing the resources.
# linstor r l -r one-vm-8871-disk-0
╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ StandAlone(m14c42) ┊ UpToDate ┊ 2020-11-10 14:38:08 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c17) ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse ┊ Ok ┊ Diskless ┊ 2020-11-09 06:35:42 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r d m11c17 one-vm-8871-disk-0
SUCCESS:
Description:
Node: m11c17, Resource: one-vm-8871-disk-0 marked for deletion.
Details:
Node: m11c17, Resource: one-vm-8871-disk-0 UUID is: f0a45ef8-f80c-478d-b2d7-da01db587d33
SUCCESS:
Notified 'm15c41' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
Notified 'm14c42' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
Deleted 'one-vm-8871-disk-0' on 'm11c17'
SUCCESS:
Notified 'm13c41' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
Description:
Node: m11c17, Resource: one-vm-8871-disk-0 deletion complete.
Details:
Node: m11c17, Resource: one-vm-8871-disk-0 UUID was: f0a45ef8-f80c-478d-b2d7-da01db587d33
# linstor r c m11c17 one-vm-8871-disk-0 -s thindata
SUCCESS:
Successfully set property key(s): StorPoolName
SUCCESS:
Description:
New resource 'one-vm-8871-disk-0' on node 'm11c17' registered.
Details:
Resource 'one-vm-8871-disk-0' on node 'm11c17' UUID is: 51971b9d-43eb-4c09-8153-9d7390146916
SUCCESS:
Description:
Volume with number '0' on resource 'one-vm-8871-disk-0' on node 'm11c17' successfully registered
Details:
Volume UUID is: 1723b233-db22-426f-8df6-9c8b1f699b52
SUCCESS:
Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm15c41'
SUCCESS:
Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm14c42'
SUCCESS:
Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm13c41'
SUCCESS:
Created resource 'one-vm-8871-disk-0' on 'm11c17'
ERROR:
Resource did not became ready on node 'm11c17' within reasonable time, check Satellite for errors.
# linstor r l -r one-vm-8871-disk-0
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ SyncTarget(2.14%) ┊ 2020-11-10 15:45:22 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c17) ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse ┊ Ok ┊ Diskless ┊ 2020-11-09 06:35:42 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Can you please also attach error reports
5F842882-00000-015924
as well as5F842882-00000-015925
sure: ErrorReport-5F842882-00000-015917.log ErrorReport-5F842882-00000-015918.log
uhm... okay... Not sure how to say it, as I really don't want to be mean... but the first one leads to 404, and although the second one works, I asked for ..24 and ..25, not ..17 and ..18 :)
Ahh sorry, I had a really nervous day today :)
ErrorReport-5F842882-00000-015924.log ErrorReport-5F842882-00000-015925.log
Thanks :)
So, after a brief research, here are my findings:
The weird NullPointerException
with no stacktrace:
When an Exception is throws too often, the JVM apparently is allowed to omit reconstructing the stacktrace for better performance since JDK 5.
That means, I'd like to ask you to check your logs and try to find one of the first ErrorReports with a corresponding NullPointerException. That ErrorReport should contain the full stacktrace that might help us identifying the actual issue.
Your attempt to re-create the resource after restarting the satellite:
# linstor r c m11c17 one-vm-8871-disk-0 -s thindata
...
ERROR:
Resource did not became ready on node 'm11c17' within reasonable time, check Satellite for errors.
... one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ SyncTarget(2.14%) ┊ 2020-11-10 15:45:22
I am not fully sure about what happened here, but as the `r l` shows the resource is already syncing. Maybe a networking issue? Some timeout and Linstor just gave up waiting too early?
* The maybe unrelated ErrorReport 015918, regarding `Resource on node m11c11 has insufficient peer slots to add another peer`
Did you check if you have more than 7 resources? If you did not have more than 7, we migth need to dig deeper into this issue...
* ErrorReport ..24 shows: `Dependency not found` - `The storage pool 'DfltStorPool' for resource 'one-vm-8871-disk-0' for volume number '0' is not deployed on node 'm11c17'.`
I guess you forgot to add a `-s $myStoragePool` to the `r c` command?
* ErrorReport ..25: I have seen this bug, and I believe it is fixed with the recently released 1.10.0. The only time I saw this bug was duging a toggle-disk.
My very wild guess: as there are ~5 seconds between ErrorReports ..24 and ..25 and if I assume that my previous guess is correct that you just forgot to add `-s $myStoragePool` to your `r c` command, I would higly assume that you also realized that mistake and fixed that quite quick... maybe within a few seconds? Now if that new `r c` was replacing a tiebreaker, Linstor automatically performs a toggle-disk (converting the diskless tiebreaker into the requested diskfull resource)
That toggle-disk could have triggered the bug I was talking about earlier.
Small update regarding the missing stacktrace (maybe someone finds this interesting) - by accident I also just ran into an endless loop of NullPointerException
s and saw that the first 6148 ErrorReports did contain the full stacktrace, whereas the next showed no stacktraces just like in the original post of this issue.
I would not assume that this count is the same on every machine, could have many factors influencing the precise threshold...
Hi @ghernadi,
The weird
NullPointerException
with no stacktrace: When an Exception is throws too often, the JVM apparently is allowed to omit reconstructing the stacktrace for better performance since JDK 5. That means, I'd like to ask you to check your logs and try to find one of the first ErrorReports with a corresponding NullPointerException. That ErrorReport should contain the full stacktrace that might help us identifying the actual issue.
Well I can provide you all the controller reports collected for this session
I am not fully sure about what happened here, but as the r l shows the resource is already syncing. Maybe a networking issue? Some timeout and Linstor just gave up waiting too early?
The resource was syncing but didn't finish it correctly:
# linstor r l -r one-vm-8871-disk-0
╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c26 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ UpToDate ┊ 2020-11-10 18:43:37 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c26) ┊ UpToDate ┊ ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse ┊ Ok ┊ Diskless ┊ 2020-11-09 06:35:42 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor v l -r one-vm-8871-disk-0
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m11c26 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 236.95 GiB ┊ Unused ┊ UpToDate ┊
┊ m13c41 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m14c42 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m15c41 ┊ one-vm-8871-disk-0 ┊ DfltDisklessStorPool ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ ┊ InUse ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I'm going investigate this, currently I collected the logs from the nodes:
The maybe unrelated ErrorReport 015918, regarding Resource on node m11c11 has insufficient peer slots to add another peer Did you check if you have more than 7 resources? If you did not have more than 7, we migth need to dig deeper into this issue...
I guess this resource never had 7 replicas, not sure if this related but every week we are doing the flowing procedure for these resources:
ErrorReport ..24 shows: Dependency not found - The storage pool 'DfltStorPool' for resource 'one-vm-8871-disk-0' for volume number '0' is not deployed on node 'm11c17'. I guess you forgot to add a -s $myStoragePool to the r c command?
Yes I usually forget to add -s $myStoragePool
to CLI commands quite often
Now if that new r c was replacing a tiebreaker, Linstor automatically performs a toggle-disk (converting the diskless tiebreaker into the requested diskfull resource) That toggle-disk could have triggered the bug I was talking about earlier.
But we have tiebreaker turned off:
# linstor c lp | grep tie
┊ DrbdOptions/auto-add-quorum-tiebreaker ┊ false ┊
Hey,
controller error reports:
Thanks for that. It took a while, but I believe now that those NullPointerExceptions
are also based on ErrorReport-5F842882-00000-015925 (which I believe is already fixed in 1.10.0)
correctly syncing DRBD:
If you are refering to the different "Allocated" sizes - you might want to check every lvm locally, Linstor should only forward what the corresponding lvs
gives per satellite.
The difference in allocation size might be just a question of trim
ing the thin-lv(s)?
insufficient peer slots I am not fully sure why you need to create resources from snapshots instead of backup up the snapshots themself... Anyway, restoring a snapshot into a new resource also copies the node-id from the snapshot to the resource (as that has to match the value of the snapshotted internal metadata) That means, no new node-id's are allocated during restoring snapshots into new resources, which cannot lead to "insufficient peer slots" error. So there has to be another cause of these messages.
auto-tiebreaker: okay.. that means my assumption is wrong. Guess I have to think harder on how this could have been triggered :) Still - chances are that 1.10.0 fixes all of these issues.
Hi @ghernadi,
Thanks for that. It took a while, but I believe now that those NullPointerExceptions are also based on ErrorReport-5F842882-00000-015925 (which I believe is already fixed in 1.10.0)
OK, I'll prepare an upgrade in a while
If you are refering to the different "Allocated" sizes - you might want to check every lvm locally, Linstor should only forward what the corresponding lvs gives per satellite.
I always see less allocated data usage for the newly created replicas than for old ones, I think this is because of discard-zeroes-if-aligned yes
which is enabled by default for the thinlvm drives, am I right?
I am not fully sure why you need to create resources from snapshots instead of backup up the snapshots themself...
Because snapshots are containing the drbd-metadata, I need to create resource from the snapshot to backup only raw data from it. Recovery is also done by full byte-copy on top of drbd-device.
As about standalone states, I fixed this by running drbdadm disconnect one-vm-8871-disk-0 && drbdadm connect one-vm-8871-disk-0
one by one on all storage-nodes:
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m11c26 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 236.95 GiB ┊ Unused ┊ UpToDate ┊
┊ m13c41 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m14c42 ┊ one-vm-8871-disk-0 ┊ thindata ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m15c41 ┊ one-vm-8871-disk-0 ┊ DfltDisklessStorPool ┊ 0 ┊ 1629 ┊ /dev/drbd1629 ┊ ┊ InUse ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Linstor-controller log: controller.log