LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
953 stars 75 forks source link

Can not list resources: java.lang.NullPointerException #195

Closed kvaps closed 2 years ago

kvaps commented 3 years ago
# linstor r c m11c17 one-vm-8871-disk-0 -s thindata
SUCCESS:
    Successfully set property key(s): StorPoolName
SUCCESS:
Description:
    New resource 'one-vm-8871-disk-0' on node 'm11c17' registered.
Details:
    Resource 'one-vm-8871-disk-0' on node 'm11c17' UUID is: f0a45ef8-f80c-478d-b2d7-da01db587d33
SUCCESS:
Description:
    Volume with number '0' on resource 'one-vm-8871-disk-0' on node 'm11c17' successfully registered
Details:
    Volume UUID is: 8f3dbb2e-b164-4a5e-80b6-aef835530551
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm15c41'
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm13c41'
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm14c42'
SUCCESS:
    Created resource 'one-vm-8871-disk-0' on 'm11c17'
ERROR:
    Resource did not became ready on node 'm11c17' within reasonable time, check Satellite for errors.
# linstor r l -r one-vm-8871-disk-0
ERROR:
Show reports:
    linstor error-reports show 5F842882-00000-015931
linstor error-reports show 5F842882-00000-015931
ERROR REPORT 5F842882-00000-015931

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.9.0
Build ID:                           678acd24a8b9b73a735407cd79ca33a5e95eb2e2
Build time:                         2020-10-03T22:33:15+00:00
Error time:                         2020-11-10 14:43:10
Node:                               linstor-controller-0

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       <UNKNOWN>

Call backtrace:

    Method                                   Native Class:Line number

END OF ERROR REPORT.

Linstor-controller log: controller.log

ghernadi commented 3 years ago

Can you please also attach error reports 5F842882-00000-015924 as well as 5F842882-00000-015925

I did not see such an error report often as you are showing here... quite strange..

However, just wondering - are you aware of these error reports?

18:35:54.935 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - Resource on node m11c11 has insufficient peer slots to add another peer [Report number 5F842882-00000-015917]
18:43:57.915 [MainWorkerPool-1] ERROR LINSTOR/Controller - SYSTEM - Resource on node m11c11 has insufficient peer slots to add another peer [Report number 5F842882-00000-015918]

If I am not mistaken, the default peer-count is 7.. do you have more resources in one resource-definition?

kvaps commented 3 years ago

After the restart linstor-controller started listing the resources.

# linstor r l -r one-vm-8871-disk-0
╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node   ┊ Port  ┊ Usage  ┊ Conns              ┊    State ┊ CreatedOn           ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ StandAlone(m14c42) ┊ UpToDate ┊ 2020-11-10 14:38:08 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok                 ┊ UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c17) ┊ UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse  ┊ Ok                 ┊ Diskless ┊ 2020-11-09 06:35:42 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor r d m11c17 one-vm-8871-disk-0
SUCCESS:
Description:
    Node: m11c17, Resource: one-vm-8871-disk-0 marked for deletion.
Details:
    Node: m11c17, Resource: one-vm-8871-disk-0 UUID is: f0a45ef8-f80c-478d-b2d7-da01db587d33
SUCCESS:
    Notified 'm15c41' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
    Notified 'm14c42' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
    Deleted 'one-vm-8871-disk-0' on 'm11c17'
SUCCESS:
    Notified 'm13c41' that 'one-vm-8871-disk-0' is being deleted on Node(s): [m11c17]
SUCCESS:
Description:
    Node: m11c17, Resource: one-vm-8871-disk-0 deletion complete.
Details:
    Node: m11c17, Resource: one-vm-8871-disk-0 UUID was: f0a45ef8-f80c-478d-b2d7-da01db587d33

# linstor r c m11c17 one-vm-8871-disk-0 -s thindata
SUCCESS:
    Successfully set property key(s): StorPoolName
SUCCESS:
Description:
    New resource 'one-vm-8871-disk-0' on node 'm11c17' registered.
Details:
    Resource 'one-vm-8871-disk-0' on node 'm11c17' UUID is: 51971b9d-43eb-4c09-8153-9d7390146916
SUCCESS:
Description:
    Volume with number '0' on resource 'one-vm-8871-disk-0' on node 'm11c17' successfully registered
Details:
    Volume UUID is: 1723b233-db22-426f-8df6-9c8b1f699b52
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm15c41'
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm14c42'
SUCCESS:
    Added peer(s) 'm11c17' to resource 'one-vm-8871-disk-0' on 'm13c41'
SUCCESS:
    Created resource 'one-vm-8871-disk-0' on 'm11c17'
ERROR:
    Resource did not became ready on node 'm11c17' within reasonable time, check Satellite for errors.

# linstor r l -r one-vm-8871-disk-0
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node   ┊ Port  ┊ Usage  ┊ Conns              ┊             State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ SyncTarget(2.14%) ┊ 2020-11-10 15:45:22 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok                 ┊          UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c17) ┊          UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse  ┊ Ok                 ┊          Diskless ┊ 2020-11-09 06:35:42 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Can you please also attach error reports 5F842882-00000-015924 as well as 5F842882-00000-015925

sure: ErrorReport-5F842882-00000-015917.log ErrorReport-5F842882-00000-015918.log

ghernadi commented 3 years ago

uhm... okay... Not sure how to say it, as I really don't want to be mean... but the first one leads to 404, and although the second one works, I asked for ..24 and ..25, not ..17 and ..18 :)

kvaps commented 3 years ago

Ahh sorry, I had a really nervous day today :)

ErrorReport-5F842882-00000-015924.log ErrorReport-5F842882-00000-015925.log

ghernadi commented 3 years ago

Thanks :)

So, after a brief research, here are my findings:

linstor r l -r one-vm-8871-disk-0

... one-vm-8871-disk-0 ┊ m11c17 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ SyncTarget(2.14%) ┊ 2020-11-10 15:45:22



I am not fully sure about what happened here, but as the `r l` shows the resource is already syncing. Maybe a networking issue? Some timeout and Linstor just gave up waiting too early?

* The maybe unrelated ErrorReport 015918, regarding `Resource on node m11c11 has insufficient peer slots to add another peer`
Did you check if you have more than 7 resources? If you did not have more than 7, we migth need to dig deeper into this issue...

* ErrorReport ..24 shows: `Dependency not found` - `The storage pool 'DfltStorPool' for resource 'one-vm-8871-disk-0' for volume number '0' is not deployed on node 'm11c17'.`
I guess you forgot to add a `-s $myStoragePool` to the `r c` command?

* ErrorReport ..25: I have seen this bug, and I believe it is fixed with the recently released 1.10.0. The only time I saw this bug was duging a toggle-disk. 
My very wild guess: as there are ~5 seconds between ErrorReports ..24 and ..25 and if I assume that my previous guess is correct that you just forgot to add `-s $myStoragePool` to your `r c` command, I would higly assume that you also realized that mistake and fixed that quite quick... maybe within a few seconds? Now if that new `r c` was replacing a tiebreaker, Linstor automatically performs a toggle-disk (converting the diskless tiebreaker into the requested diskfull resource)
That toggle-disk could have triggered the bug I was talking about earlier.
ghernadi commented 3 years ago

Small update regarding the missing stacktrace (maybe someone finds this interesting) - by accident I also just ran into an endless loop of NullPointerExceptions and saw that the first 6148 ErrorReports did contain the full stacktrace, whereas the next showed no stacktraces just like in the original post of this issue.

I would not assume that this count is the same on every machine, could have many factors influencing the precise threshold...

kvaps commented 3 years ago

Hi @ghernadi,

The weird NullPointerException with no stacktrace: When an Exception is throws too often, the JVM apparently is allowed to omit reconstructing the stacktrace for better performance since JDK 5. That means, I'd like to ask you to check your logs and try to find one of the first ErrorReports with a corresponding NullPointerException. That ErrorReport should contain the full stacktrace that might help us identifying the actual issue.

Well I can provide you all the controller reports collected for this session

I am not fully sure about what happened here, but as the r l shows the resource is already syncing. Maybe a networking issue? Some timeout and Linstor just gave up waiting too early?

The resource was syncing but didn't finish it correctly:

# linstor r l -r one-vm-8871-disk-0
╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName       ┊ Node   ┊ Port  ┊ Usage  ┊ Conns              ┊    State ┊ CreatedOn           ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ one-vm-8871-disk-0 ┊ m11c26 ┊ 55478 ┊ Unused ┊ Connecting(m14c42) ┊ UpToDate ┊ 2020-11-10 18:43:37 ┊
┊ one-vm-8871-disk-0 ┊ m13c41 ┊ 55478 ┊ Unused ┊ Ok                 ┊ UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m14c42 ┊ 55478 ┊ Unused ┊ StandAlone(m11c26) ┊ UpToDate ┊                     ┊
┊ one-vm-8871-disk-0 ┊ m15c41 ┊ 55478 ┊ InUse  ┊ Ok                 ┊ Diskless ┊ 2020-11-09 06:35:42 ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯

# linstor v l -r one-vm-8871-disk-0
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node   ┊ Resource           ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊  Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m11c26 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 236.95 GiB ┊ Unused ┊ UpToDate ┊
┊ m13c41 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m14c42 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m15c41 ┊ one-vm-8871-disk-0 ┊ DfltDisklessStorPool ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊            ┊ InUse  ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I'm going investigate this, currently I collected the logs from the nodes:

The maybe unrelated ErrorReport 015918, regarding Resource on node m11c11 has insufficient peer slots to add another peer Did you check if you have more than 7 resources? If you did not have more than 7, we migth need to dig deeper into this issue...

I guess this resource never had 7 replicas, not sure if this related but every week we are doing the flowing procedure for these resources:

ErrorReport ..24 shows: Dependency not found - The storage pool 'DfltStorPool' for resource 'one-vm-8871-disk-0' for volume number '0' is not deployed on node 'm11c17'. I guess you forgot to add a -s $myStoragePool to the r c command?

Yes I usually forget to add -s $myStoragePool to CLI commands quite often

Now if that new r c was replacing a tiebreaker, Linstor automatically performs a toggle-disk (converting the diskless tiebreaker into the requested diskfull resource) That toggle-disk could have triggered the bug I was talking about earlier.

But we have tiebreaker turned off:

# linstor c lp | grep tie
┊ DrbdOptions/auto-add-quorum-tiebreaker    ┊ false                ┊
ghernadi commented 3 years ago

Hey,

kvaps commented 3 years ago

Hi @ghernadi,

Thanks for that. It took a while, but I believe now that those NullPointerExceptions are also based on ErrorReport-5F842882-00000-015925 (which I believe is already fixed in 1.10.0)

OK, I'll prepare an upgrade in a while

If you are refering to the different "Allocated" sizes - you might want to check every lvm locally, Linstor should only forward what the corresponding lvs gives per satellite.

I always see less allocated data usage for the newly created replicas than for old ones, I think this is because of discard-zeroes-if-aligned yes which is enabled by default for the thinlvm drives, am I right?

I am not fully sure why you need to create resources from snapshots instead of backup up the snapshots themself...

Because snapshots are containing the drbd-metadata, I need to create resource from the snapshot to backup only raw data from it. Recovery is also done by full byte-copy on top of drbd-device.

kvaps commented 3 years ago

As about standalone states, I fixed this by running drbdadm disconnect one-vm-8871-disk-0 && drbdadm connect one-vm-8871-disk-0 one by one on all storage-nodes:

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node   ┊ Resource           ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊  Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ m11c26 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 236.95 GiB ┊ Unused ┊ UpToDate ┊
┊ m13c41 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m14c42 ┊ one-vm-8871-disk-0 ┊ thindata             ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊ 250.05 GiB ┊ Unused ┊ UpToDate ┊
┊ m15c41 ┊ one-vm-8871-disk-0 ┊ DfltDisklessStorPool ┊     0 ┊    1629 ┊ /dev/drbd1629 ┊            ┊ InUse  ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯