LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
981 stars 76 forks source link

The current scope has already been entered #194

Closed ddpolyakov closed 2 years ago

ddpolyakov commented 3 years ago

Hi there. Im using linstor over Kubernetes and made a several tests with creating pvc/destroying pvc and redeploying linstor control plane. Im using etcd as a database. After one iteration my controller started to suffer from a nil pointer mistake

there is a text of error-report.

`ERROR REPORT 5FA7545B-00000-000000

============================================================

Application: LINBIT? LINSTOR Module: Controller Version: 1.9.0 Build ID: 678acd24a8b9b73a735407cd79ca33a5e95eb2e2 Build time: 2020-09-23T10:27:49+00:00 Error time: 2020-11-08 05:15:55 Node: linstor-controller-0

============================================================

Reported error:

Category: RuntimeException Class name: NullPointerException Class canonical name: java.lang.NullPointerException Generated at: Method '', Source file 'Resource.java', Line #96

Call backtrace:

Method                                   Native Class:Line number
<init>                                   N      com.linbit.linstor.core.objects.Resource:96
load                                     N      com.linbit.linstor.core.objects.ResourceDbDriver:161
load                                     N      com.linbit.linstor.core.objects.ResourceDbDriver:44
loadAll                                  N      com.linbit.linstor.dbdrivers.etcd.ETCDEngine:144
loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:100
loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:340
loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
start                                    N      com.linbit.linstor.core.Controller:337
main                                     N      com.linbit.linstor.core.Controller:556

END OF ERROR REPORT.`

and after that I can see a The current scope has already been entered error. Also controller cannot fetch any of nodes/resources from database. But they are still inside etcd. Any ideas? Or anything else I can attach to be more informational?

ddpolyakov commented 3 years ago

I have added some additional logging info inside ResourceDbDriver + errorReporter.logDebug(rscName +" " + nodeName + " " + flags); ret = new Pair<>( new Resource( raw.build(UUID, java.util.UUID::fromString), getObjectProtection(ObjectProtection.buildPath(nodeName, rscName)), loadAllDataRef.objB.get(rscName), loadAllDataRef.objA.get(nodeName),

And I can see only one reference to that (possibly it falls and breaks after) 20:43:10.953 [Main] DEBUG LINSTOR/Controller - SYSTEM - PVC-08844D17-0AA9-4507-8D15-3B5F9BC728E3 KBT-NODE11.I 0 As far as I can understand (me myself is not a java coder at all) - it fails on trying to get loadAllDataRef.objA.get(nodeName) information on kbt-node11.i node

ddpolyakov commented 3 years ago

For sure it led me to Resource.java:96 - where I found resDfnRef.getName() and thats where Controller hits the nil pointer. Still I can see all the information on resource in etcd. Ive got around 15 resources in etcd and Im afraid, because PVC-08844D17-0AA9-4507-8D15-3B5F9BC728E3 got "0" id - thath all others are broken too. Any ideas? or its better to recreate database (its still an option)?

ghernadi commented 3 years ago

Thank you for the hint that you already added a debug-line - that explains the "strange" stacktrace, as usually the Resource's constructor is called in line 157 in v1.9.0, but your stacktrace shows 161.. I was puzzled a bit until I saw you were modifying the code a bit (which is fine :) )

As you already said, the exception is thrown at the resDfnRef.getName() line. In other words the resource-definition was not found during loading, which is quite strange, should never happen. What you consider as id "0", is not an ID, but the flags the resource has as DRBD_DISKLESS, INACTIVE, although instead of storing those flags as string, we store it by single bits composed in a long variable. So "flag = 0" means, no flags set, which is completely fine. If that resource has DRBD in its layer-list, that means that resource will be a diskful replica.

Back to the actual issue:
There is unfortunately not much I can see here, why this error occurs. Would you mind send me the etcd-dump so I can take a closer look / try to reproduce? (e-mail-address is in my profile)

Apart from that, recreating the database is always an option. The Controller only stores data in that database, so if you delete it, the controller can re-start with an empty cluster.

ddpolyakov commented 3 years ago

thanks for reply. By "0" as an id I got the record in database like /LINSTOR/LAYER_RESOURCE_IDS/0/RESOURCE_NAME PVC-8FB6D2A2-99CD-4945-9B04-D9B7B6827F8E I have recreated database - so it is not an urgent issue. I will send you dump I have kept during the day.

dkhachyan commented 3 years ago

Hi! Same error with linstor 1.11.1 on kubernetes. After several create/delete resources, controller re-starts with an empty cluster. We resolved this issue by enabling dubbing on etcd cluster. We found /LINSTOR/LAYER_RESOURCE_IDS/ with inconsistent info in database and removed them.

ERROR REPORT 6012CCA4-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.11.1
Build ID:                           fe95a94d86c66c6c9846a3cf579a1a776f95d3f4
Build time:                         2021-01-13T08:34:55+00:00
Error time:                         2021-01-28 14:39:42
Node:                               linstor-cs-controller-766f4fd4c9-69klj

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'restoreDrbdVolumes', Source file 'DrbdLayerETCDDriver.java', Line #657

Call backtrace:

    Method                                   Native Class:Line number
    restoreDrbdVolumes                       N      com.linbit.linstor.core.objects.DrbdLayerETCDDriver:657
    load                                     N      com.linbit.linstor.core.objects.DrbdLayerETCDDriver:564
    loadLayerData                            N      com.linbit.linstor.dbdrivers.DatabaseLoader:688
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:592
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:527
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:337
    main                                     N      com.linbit.linstor.core.Controller:558

END OF ERROR REPORT.
ghernadi commented 3 years ago

Just out of curiosity: is this ErrorReport from before or after you deleted Linstor entries from ETCD?

ledzepppelin commented 3 years ago

i had the same problem a while ago (1.10 i think it was), i rolled back my etcd day-by-day, found a working one, than narrowed it down to the problematic resource. the problem appeared after "linstor rd d pvc-143b5d66-22f3-4abc-be12-37512f319a10". for some reason there were duplicate keys in etcd for that resource definition. after "rd d" they were causing nullpointer exception. so i went back to my last state, deleted those keys, and it worked.

here are prefixes i deleted from etcd. pvc actually had different id's (not 1016 and 1017) which were correctly deleted during "rd d". what hell did 1016 and 1017 came from i really don't know. /LINSTOR/LAYER_DRBD_RESOURCES/1016 /LINSTOR/LAYER_DRBD_VOLUMES/1016:0 /LINSTOR/LAYER_RESOURCE_IDS/1016 /LINSTOR/LAYER_RESOURCE_IDS/1017 /LINSTOR/LAYER_STORAGE_VOLUMES/1017:0 /LINSTOR/LAYER_RESOURCE_IDS/1016/RESOURCE_NAME PVC-143B5D66-22F3-4ABC-BE12-37512F319A10

thanks for the knowledge on etcd and linstor, was a cool problem to solve

ghernadi commented 3 years ago

I would be interested in a full ETCD dump of a broken version for further investigation. feel free to send me such a dump via email (mail in my profile). In an ideal world I would also love an ETCD dump of a working-before-"rd d"-command version, but I guess that is too much asked for :)

ledzepppelin commented 3 years ago

I would be interested in a full ETCD dump of a broken version for further investigation. feel free to send me such a dump via email (mail in my profile). In an ideal world I would also love an ETCD dump of a working-before-"rd d"-command version, but I guess that is too much asked for :)

i don't think i still have snapshots, but i'm pretty sure i have full etcd-watch, maybe broken into some parts where i tried to figure out what was going on, i'll search my files on weekend and send it to you if i'll find'em, maybe you could even find out where 1016 and 1017 came from

edit: i actually found some files on my home pc, i'll send them to you now, but it's really a mess, i don't even remember what i was doing exactly. the largest file is probably a full "etcdctl watch"-dump, the rest must be from near where the problem occured, etcd_restore.sh is a script i wrote back then which i recall was helpful

edit2: sent, "ce1" is probably the begging of "rd d"

dkhachyan commented 3 years ago

Just out of curiosity: is this ErrorReport from before or after you deleted Linstor entries from ETCD?

This report is from before. Yesterday, linstor server stated with empty database again. I sent you etcd dump and error report to email.

rp- commented 2 years ago

We had a few scope fixes, so if you see this problem again, please reopen