LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
953 stars 75 forks source link

experimental Kubernetes API as internal state store - can`t load resources after controller restart #263

Closed msheldyakov closed 2 years ago

msheldyakov commented 2 years ago

After several days of test use, the controller reboot ended in the inability to load resources. Installation via piraeus v1.7.0-rc.2.

Controller log:

16:15:18.168 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
16:15:38.813 [Main] ERROR LINSTOR/Controller - SYSTEM - Problem of type 'java.lang.NullPointerException' logged to report number 619BC20D-00000-000000

linstor err show 619BC20D-00000-000000

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.16.0
Build ID:                           4d8a85cf23554eaf6a65c4f5b56fa62cf8b285eb
Build time:                         2021-11-11T12:47:51+00:00
Error time:                         2021-11-22 16:15:38
Node:                               linstor-piraeus-cs-controller-b48d46b84-jqfcf

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'restoreDrbdVolumes', Source file 'DrbdLayerK8sCrdDriver.java', Line #606

Call backtrace:

    Method                                   Native Class:Line number
    restoreDrbdVolumes                       N      com.linbit.linstor.core.objects.DrbdLayerK8sCrdDriver:606
    load                                     N      com.linbit.linstor.core.objects.DrbdLayerK8sCrdDriver:554
    loadLayerData                            N      com.linbit.linstor.dbdrivers.DatabaseLoader:768
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:666
    loadAll                                  N      com.linbit.linstor.dbdrivers.DatabaseLoader:584
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:176
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:108
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:345
    main                                     N      com.linbit.linstor.core.Controller:583

END OF ERROR REPORT.
ghernadi commented 2 years ago

From the stacktrace, the error is about a missing entry in the database. Linstor tries to load the resources with their volumes but does not find the expected DB entries in the corresponding DRBD CRD.

If you send me a database dump (either manually export all Linstor CRDs or use linstor sos-report download and send me the file) I might be able to tell more.

Otherwise it would certainly help to know what happened before the reboot. Any issues, any strange behavior during resource creation / deletion?

msheldyakov commented 2 years ago

Here is a backup of the CRD resources https://drive.google.com/file/d/1x-vQjtgbJfvdCVrxm7zs_L_-zUfTl8zE/view?usp=sharing

Any issues, any strange behavior during resource creation / deletion?

Can't choose one problem, this is a test bench where we checked for failures. Next time I will write down a detailed log of actions.

msheldyakov commented 2 years ago

My bad, shared a file. https://drive.google.com/file/d/1x-vQjtgbJfvdCVrxm7zs_L_-zUfTl8zE/view?usp=sharing

ghernadi commented 2 years ago

I had no time to experiment, but is it possible that you once forgot to create volume-definitions for a resource (namely PVC-3DCE9E04-5B7F-4CC1-AFCB-B466D03524DC) ?

If so, then I might understand the issue without having dug deeper into it. If not.. all I can say right now is that the database contains entries that the resource exists, the resource has DrbdRscData (which also exists), but those DrbdRscData should have DrbdVlmData entries which do not exist (that is what the error message complains about).

Usually we do not recommend modifying the database manually, but I have to understand the actual problem better before I can come up with a proper solution (without having to ask you to temper with the database manually).

msheldyakov commented 2 years ago

but is it possible that you once forgot to create volume-definitions for a resource?

I did not create volume definition manually. Everything was created automatically with default piraeus-operator setup, via linstor-csi.

Usually we do not recommend modifying the database manually

This is a test cluster specifically for the purpose of testing k8s as a linstor store. There is no problem with data loss, cluster recovery is not required. My only intention is to leave bug reports to improve the linstor as a product.

If the right place for feedback on this is the Piraeus operator repository - please let me know.

ghernadi commented 2 years ago

Thank you for the information. However, right now I'd need more information to continue investigating.. Of course ideal would be some kind of reproducer, but I do not assume you have one, or the time finding one. I will try to keep this in my radar, but cannot promise currently anything.