LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
978 stars 76 forks source link

Satellite has established a connection to a different controller #355

Open acidrop opened 1 year ago

acidrop commented 1 year ago

I've seen a strange behaviour on my test cluster recently. At random times, Controller refuses to connect to a Satellite claiming that the Satellite has established a connection to a different controller. Controller is running on 3 nodes and it's managed by drbd-reactor.

Restarting the affected Satellite node has no effect, but restarting the Controller "fixes" the problem until the next time when a random Satellite will be affected by the same issue. To me, this does not seem to be an issue on the Satellite(s) but rather on the Controller side, but I may be wrong.

Could this be a bug or something specific to this system ?

┊ Node ┊ NodeType  ┊ Addresses               ┊ State                     ┊
╞════════════════════════════════════════════════════════════════════════╡
┊ pve1 ┊ SATELLITE ┊ 10.10.13.1:3366 (PLAIN) ┊ Online                    ┊
┊ pve2 ┊ SATELLITE ┊ 10.10.13.2:3366 (PLAIN) ┊ Online                    ┊
┊ pve3 ┊ SATELLITE ┊ 10.10.13.3:3366 (PLAIN) ┊ Online                    ┊
┊ pve4 ┊ SATELLITE ┊ 10.10.13.4:3366 (PLAIN) ┊ OFFLINE(OTHER_CONTROLLER) ┊
┊ pve5 ┊ SATELLITE ┊ 10.10.13.5:3366 (PLAIN) ┊ Online                    ┊
╰───────────────────────────────────────────────────
Controller
—————————-

-- Logs begin at Tue 2021-11-16 17:31:25 UTC, end at Thu 2023-04-27 16:23:11 UTC. --
Apr 26 06:39:17 linctrl02 systemd[1]: Starting drbd-reactor controlled linstor-controller...
Apr 26 06:39:21 linctrl02 Controller[16524]: LINSTOR, Module Controller
Apr 26 06:39:22 linctrl02 Controller[16524]: Version:            1.22.0 (0dd2e7d8bad7b0115e924ef66371568320898285)
Apr 26 06:39:22 linctrl02 Controller[16524]: Build time:         2023-04-17T12:43:16+00:00
Apr 26 06:39:22 linctrl02 Controller[16524]: Java Version:       11
Apr 26 06:39:22 linctrl02 Controller[16524]: Java VM:            Ubuntu, Version 11.0.18+10-post-Ubuntu-0ubuntu120.04.1
Apr 26 06:39:22 linctrl02 Controller[16524]: Operating system:   Linux, Version 5.4.0-146-generic
Apr 26 06:39:22 linctrl02 Controller[16524]: Environment:        amd64, 2 processors, 121 MiB memory reserved for allocations
Apr 26 06:39:22 linctrl02 Controller[16524]: System components initialization in progress
Apr 26 06:39:24 linctrl02 Controller[16524]: Loading configuration file "/etc/linstor/linstor.toml"
Apr 26 06:39:27 linctrl02 Controller[16524]: 06:39:27.914 [main] INFO  LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
Apr 26 06:39:27 linctrl02 Controller[16524]: 06:39:27.918 [main] INFO  LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
Apr 26 06:39:28 linctrl02 Controller[16524]: 06:39:28.151 [main] INFO  LINSTOR/Controller - SYSTEM - Database type is SQL
Apr 26 06:39:28 linctrl02 Controller[16524]: 06:39:28.154 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading API classes started.
Apr 26 06:39:29 linctrl02 Controller[16524]: 06:39:29.967 [Main] INFO  LINSTOR/Controller - SYSTEM - API classes loading finished: 1812ms
Apr 26 06:39:29 linctrl02 Controller[16524]: 06:39:29.968 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection started.
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.089 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.091 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.092 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.167 [Main] INFO  LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.168 [Main] INFO  LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.169 [Main] INFO  LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" is not installed
Apr 26 06:39:34 linctrl02 Controller[16524]: 06:39:34.010 [Main] INFO  LINSTOR/Controller - SYSTEM - Dependency injection finished: 4042ms
Apr 26 06:39:34 linctrl02 Controller[16524]: 06:39:34.011 [Main] INFO  LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
Apr 26 06:39:35 linctrl02 Controller[16524]: 06:39:35.143 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.890 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.898 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the database connection pool
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.911 [Main] INFO  LINSTOR/Controller - SYSTEM - SQL database connection URL is "jdbc:h2:/var/lib/linstor/linstordb"
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.287 [Main] INFO  LINSTOR/Controller - SYSTEM - SQL database is H2
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.331 [Main] WARN  org.flywaydb.core.api.Location - Use of dots (.) as path separators will be deprecated in Flyway 7. Path: com.linbit.linstor.dbcp.migration
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.354 [Main] INFO  org.flywaydb.core.internal.license.VersionPrinter - Flyway Community Edition 6.5.7 by Redgate
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.525 [Main] INFO  org.flywaydb.core.internal.database.DatabaseFactory - Database: jdbc:h2:/var/lib/linstor/linstordb (H2 1.4)
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.785 [Main] INFO  org.flywaydb.core.internal.command.DbValidate - Successfully validated 69 migrations (execution time 00:00.168s)
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.905 [Main] INFO  org.flywaydb.core.internal.command.DbMigrate - Current version of schema "LINSTOR": 2023.03.22.10.00
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.908 [Main] WARN  org.flywaydb.core.internal.command.DbMigrate - outOfOrder mode is active. Migration of schema "LINSTOR" may not be reproducible.
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.953 [Main] INFO  org.flywaydb.core.internal.command.DbMigrate - Schema "LINSTOR" is up to date. No migration necessary.
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.968 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'DatabaseService' of type DatabaseService
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.968 [Main] INFO  LINSTOR/Controller - SYSTEM - Loading security objects
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.015 [Main] INFO  LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.064 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.995 [Main] INFO  LINSTOR/Controller - SYSTEM - Core objects load from database completed
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.004 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'TaskScheduleService' of type TaskScheduleService
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.007 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing network communications services
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.010 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'DebugSslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.088 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'PlainConnector'
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.089 [Main] WARN  LINSTOR/Controller - SYSTEM - The SSL network communication service 'SslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.102 [Main] INFO  LINSTOR/Controller - SYSTEM - Created network communication service 'SslConnector'
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.105 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnecting to previously known nodes
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.282 [Main] INFO  LINSTOR/Controller - SYSTEM - Reconnect requests sent
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.289 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'ScheduleBackupService' of type ScheduleBackupService
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.295 [Main] INFO  LINSTOR/Controller - SYSTEM - Starting service instance 'EbsStatusPoll' of type EbsStatusPoll
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: An illegal reflective access operation has occurred
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Illegal reflective access by com.sun.xml.bind.v2.runtime.reflect.opt.Injector (file:/usr/share/linstor-server/lib/jaxb-impl-2.2.11.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int)
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Please consider reporting this to the maintainers of com.sun.xml.bind.v2.runtime.reflect.opt.Injector
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: All illegal access operations will be denied in a future release
Apr 26 06:39:44 linctrl02 Controller[16524]: Apr 26, 2023 6:39:44 AM org.glassfish.grizzly.http.server.NetworkListener start
Apr 26 06:39:44 linctrl02 Controller[16524]: INFO: Started listener bound to [[::]:3370]
Apr 26 06:39:44 linctrl02 Controller[16524]: Apr 26, 2023 6:39:44 AM org.glassfish.grizzly.http.server.HttpServer start
Apr 26 06:39:44 linctrl02 Controller[16524]: INFO: [HttpServer] Started.
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.352 [Main] INFO  LINSTOR/Controller - SYSTEM - Controller initialized
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.646 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.787 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
Apr 26 06:39:45 linctrl02 systemd[1]: Started drbd-reactor controlled linstor-controller.
Apr 27 06:23:36 linctrl02 Controller[16524]: 06:23:36.413 [MainWorkerPool-2] WARN  LINSTOR/Controller - SYSTEM - Satellite pve4 has established a connection to a different controller
Apr 27 06:39:39 linctrl02 Controller[16524]: 06:39:39.729 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
Apr 27 06:39:40 linctrl02 Controller[16524]: 06:39:40.720 [TaskScheduleService] INFO  LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
Apr 27 14:00:15 linctrl02 Controller[16524]: 14:00:15.196 [grizzly-http-server-2] ERROR LINSTOR/Controller - SYSTEM - No active connection to satellite 'pve4'. [Report number 6448C71C-00000-000000]
Apr 27 16:00:19 linctrl02 Controller[16524]: 16:00:19.655 [grizzly-http-server-3] ERROR LINSTOR/Controller - SYSTEM - No active connection to satellite 'pve4'. [Report number 6448C71C-00000-000001]
Satellite
———————

Apr 27 17:44:54 pve4 Satellite[1624881]: 17:44:54.243 [Thread-32] INFO  LINSTOR/Satellite - SYSTEM - Shutdown complete
Apr 27 17:44:56 pve4 systemd[1]: linstor-satellite.service: Succeeded.
Apr 27 17:44:56 pve4 systemd[1]: Stopped LINSTOR Satellite Service.
Apr 27 17:44:56 pve4 systemd[1]: linstor-satellite.service: Consumed 5h 20min 24.288s CPU time.
Apr 27 17:44:56 pve4 systemd[1]: Started LINSTOR Satellite Service.
Apr 27 17:45:06 pve4 Satellite[1051398]: LINSTOR, Module Satellite
Apr 27 17:45:11 pve4 Satellite[1051398]: Version:            1.22.0 (0dd2e7d8bad7b0115e924ef66371568320898285)
Apr 27 17:45:11 pve4 Satellite[1051398]: Build time:         2023-04-17T11:24:18+00:00
Apr 27 17:45:11 pve4 Satellite[1051398]: Java Version:       11
Apr 27 17:45:11 pve4 Satellite[1051398]: Java VM:            Debian, Version 11.0.18+10-post-Debian-1deb11u1
Apr 27 17:45:11 pve4 Satellite[1051398]: Operating system:   Linux, Version 5.15.35-1-pve
Apr 27 17:45:11 pve4 Satellite[1051398]: Environment:        amd64, 8 processors, 1992 MiB memory reserved for allocations
Apr 27 17:45:11 pve4 Satellite[1051398]: System components initialization in progress
Apr 27 17:45:35 pve4 Satellite[1051398]: 17:45:35.517 [main] INFO  LINSTOR/Satellite - SYSTEM - ErrorReporter DB version 1 found.
Apr 27 17:45:35 pve4 Satellite[1051398]: 17:45:35.520 [main] INFO  LINSTOR/Satellite - SYSTEM - Log directory set to: '/var/log/linstor-satelli
te'
Apr 27 17:45:36 pve4 Satellite[1051398]: 17:45:36.961 [Main] INFO  LINSTOR/Satellite - SYSTEM - Loading API classes started.
Apr 27 17:45:38 pve4 Satellite[1051398]: 17:45:38.935 [Main] INFO  LINSTOR/Satellite - SYSTEM - API classes loading finished: 1974ms
Apr 27 17:45:38 pve4 Satellite[1051398]: 17:45:38.936 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection started.
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.434 [Main] INFO  LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "co
m.linbit.linstor.modularcrypto.FipsCryptoModule"
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.435 [Main] INFO  LINSTOR/Satellite - SYSTEM - Extension module "com.linbit.linstor.modularcry
pto.FipsCryptoModule" is not installed
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.435 [Main] INFO  LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "co
m.linbit.linstor.modularcrypto.JclCryptoModule"
Apr 27 17:45:40 pve4 Satellite[1051398]: 17:45:40.123 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dynamic load of extension module "com.linbit.li
nstor.modularcrypto.JclCryptoModule" was successful
Apr 27 17:45:49 pve4 Satellite[1051398]: 17:45:49.721 [Main] INFO  LINSTOR/Satellite - SYSTEM - Dependency injection finished: 10784ms
Apr 27 17:45:49 pve4 Satellite[1051398]: 17:45:49.722 [Main] INFO  LINSTOR/Satellite - SYSTEM - Cryptography provider: Using default cryptograp
hy module
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.203 [Main] INFO  LINSTOR/Satellite - SYSTEM - Initializing main network communications servic
e
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.203 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'TimerEventService' o
f type TimerEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.204 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'FileEventService' of
 type FileEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.204 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventService-1' 
of type DrbdEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventPublisher-1
' of type DrbdEventPublisher
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'SnapshotShippingServ
ice' of type SnapshotShippingService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingS3Serv
ice' of type BackupShippingS3Service
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.208 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingL2LSer
vice' of type BackupShippingL2LService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.208 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'DeviceManager' of ty
pe DeviceManager
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.209 [Main] INFO  LINSTOR/Satellite - SYSTEM - Starting service instance 'CloneService' of typ
e CloneService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.965 [Main] INFO  LINSTOR/Satellite - SYSTEM - NetComService started on port /0:0:0:0:0:0:0:0:
3366

Package versions:

Satellite
---------
ii  linstor-client                       1.18.0-1                       all          Linstor client command line tool
ii  linstor-common                       1.22.0-1                       all          DRBD distributed resource management utility
ii  linstor-proxmox                      7.0.0-1                        all          DRBD distributed resource management utility
ii  linstor-satellite                    1.22.0-1                       all          DRBD distributed resource management utility
ii  python-linstor                       1.18.0-1                       all          Linstor python api library

cat /proc/drbd
version: 9.2.3 (api:2/proto:86-122
Controller
----------
ii  linstor-client                         1.18.0-1ppa1~focal1               all          Linstor client command line tool
ii  linstor-common                         1.22.0-1ppa1~focal1               all          DRBD distributed resource management utility
ii  linstor-controller                     1.22.0-1ppa1~focal1               all          DRBD distributed resource management utility
ii  python-linstor                         1.18.0-1ppa1~focal1               all          Linstor python api library
cat /proc/drbd
version: 9.2.3 (api:2/proto:86-122)
Mimikoo commented 1 year ago

I also do facing same problem. With two different controller on two different servers are installed, satellites connect to another controller if the main controller is restarting. But this is not primary-secondary controller setup. It is like satellites do a sort of dhcp to find a controller to connect.

rp- commented 1 year ago

There should only be one active(started) controller at a time. It is not supported having multiple controller running at the same time using the same satellite nodes.

rck commented 1 year ago

yes Mimikoo's setup looks just wrong, but acidrop uses drbd-reactor, there shouldn't be two instances of the controller active at a time.

rck commented 1 year ago

some TCP timeout thingy where this races? where the old controller is gone and the new started but the satellite did not see that the old is already gone? basically why some setups use some kind of "portblock" magic?

Mimikoo commented 1 year ago

Just realized that not satellites are connecting to controllers, but controllers are connecting to satellites. My setup was a old controller that where in running state with nodes defined. And the new controller was restarted.

There must be some configurations for satellites to define a password or some connection authentication so a malicious controller would not get nodes and break the informaion.

acidrop commented 1 year ago

Looks like this is reproducible only when the nodes are under heavy load (drbd-reactor and the linstor controller in my use case is running within VMs).