ControlSystemStudio / cs-studio

Control System Studio is an Eclipse-based collections of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
https://controlsystemstudio.org/
Eclipse Public License 1.0
113 stars 96 forks source link

Connection problems with BOY #2300

Open claudio-rosati opened 6 years ago

claudio-rosati commented 6 years ago

One of my users has the following problem.

After restarting the IOC my 4.5.1.0 (CR: the one built over the current master branch) Linux CSS was unable to reconnect.

These messages appear in CSS Console:

2017-10-10 10:16:31 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:34 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:34 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:34 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:16:37 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:41 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:47 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:49 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:50 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:53 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:54 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:56 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:16:56 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:56 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:16:59 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:02 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:02 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:03 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:09 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:12 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:18 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:18 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:18 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:19 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:23 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:27 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:27 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:27 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:27 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:28 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:30 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:30 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:30 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:31 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:31 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:33 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:33 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:34 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:37 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:37 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: No channel transport available, channel disconnected.
2017-10-10 10:17:39 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:40 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:41 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:42 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:42 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:43 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:45 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:48 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:49 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:50 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:51 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:52 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:53 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:17:54 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:55 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:17:55 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: transport closed
2017-10-10 10:18:00 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.
2017-10-10 10:18:01 ERROR: Error from pv connection layer: 
java.lang.IllegalStateException: Channel not connected.

On IOC side these messages appear:

CAS: forcing disconnect from 194.47.243.119:35340
CAS: forcing disconnect from 194.47.243.119:35342
CAS: request from 194.47.243.119:35344 => bad resource ID
CAS: Request from 194.47.243.119:35344 => cmmd=1 cid=0x2a29 type=34 count=1 postsize=16
CAS: Request from 194.47.243.119:35344 =>   available=0x4966    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35344
CAS: forcing disconnect from 194.47.243.119:35346
CAS: forcing disconnect from 194.47.243.119:35348
CAS: forcing disconnect from 194.47.243.119:35350
CAS: forcing disconnect from 194.47.243.119:35352
CAS: request from 194.47.243.119:35354 => bad resource ID
CAS: Request from 194.47.243.119:35354 => cmmd=1 cid=0x3268 type=17 count=1 postsize=16
CAS: Request from 194.47.243.119:35354 =>   available=0x48e1    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35354
CAS: forcing disconnect from 194.47.243.119:35356
CAS: forcing disconnect from 194.47.243.119:35358
CAS: forcing disconnect from 194.47.243.119:35360
CAS: request from 194.47.243.119:35362 => bad resource ID
CAS: Request from 194.47.243.119:35362 => cmmd=1 cid=0x2a2d type=34 count=1 postsize=16
CAS: Request from 194.47.243.119:35362 =>   available=0x3775    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35362
CAS: forcing disconnect from 194.47.243.119:35364
CAS: forcing disconnect from 194.47.243.119:35366
CAS: forcing disconnect from 194.47.243.119:35368
CAS: forcing disconnect from 194.47.243.119:35370
CAS: request from 194.47.243.119:35372 => bad resource ID
CAS: Request from 194.47.243.119:35372 => cmmd=1 cid=0x3a96 type=14 count=1 postsize=16
CAS: Request from 194.47.243.119:35372 =>   available=0x3c8a    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35372
CAS: forcing disconnect from 194.47.243.119:35374
CAS: request from 194.47.243.119:35376 => bad resource ID
CAS: Request from 194.47.243.119:35376 => cmmd=1 cid=0x3c04 type=20 count=1 postsize=16
CAS: Request from 194.47.243.119:35376 =>   available=0x50fb    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35376
CAS: forcing disconnect from 194.47.243.119:35378
CAS: forcing disconnect from 194.47.243.119:35382
CAS: forcing disconnect from 194.47.243.119:35384
CAS: forcing disconnect from 194.47.243.119:35386
CAS: forcing disconnect from 194.47.243.119:35388
CAS: forcing disconnect from 194.47.243.119:35390
CAS: request from 194.47.243.119:35392 => bad resource ID
CAS: Request from 194.47.243.119:35392 => cmmd=1 cid=0x41e5 type=17 count=1 postsize=16
CAS: Request from 194.47.243.119:35392 =>   available=0x4d0e    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35392
CAS: forcing disconnect from 194.47.243.119:35394
CAS: request from 194.47.243.119:35396 => bad resource ID
CAS: Request from 194.47.243.119:35396 => cmmd=1 cid=0x40ed type=34 count=1 postsize=16
CAS: Request from 194.47.243.119:35396 =>   available=0x59a9    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35396
CAS: request from 194.47.243.119:35398 => bad resource ID
CAS: Request from 194.47.243.119:35398 => cmmd=1 cid=0x421e type=19 count=1 postsize=16
CAS: Request from 194.47.243.119:35398 =>   available=0x58ef    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35398
CAS: forcing disconnect from 194.47.243.119:35400
CAS: request from 194.47.243.119:35402 => bad resource ID
CAS: Request from 194.47.243.119:35402 => cmmd=1 cid=0x43a4 type=24 count=1 postsize=16
CAS: Request from 194.47.243.119:35402 =>   available=0x759     N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35402
CAS: forcing disconnect from 194.47.243.119:35404
CAS: forcing disconnect from 194.47.243.119:35406
CAS: forcing disconnect from 194.47.243.119:35408
CAS: forcing disconnect from 194.47.243.119:35410
CAS: forcing disconnect from 194.47.243.119:35412
CAS: forcing disconnect from 194.47.243.119:35414
CAS: forcing disconnect from 194.47.243.119:35416
CAS: request from 194.47.243.119:35418 => bad resource ID
CAS: Request from 194.47.243.119:35418 => cmmd=1 cid=0x4316 type=34 count=1 postsize=16
CAS: Request from 194.47.243.119:35418 =>   available=0x5504    N=0 paddr=(nil)
CAS: forcing disconnect from 194.47.243.119:35418
CAS: forcing disconnect from 194.47.243.119:35420

Restarting IOC again does not solve the problem. Only a restart of CSS solved the problem.

The OPI comes from here https://github.com/areaDetector/ADCore/blob/master/ADApp/op/opi/autoconvert/NDFileHDF5.opi.

I (CR: the user) do not think the problem is specific to that OPI since I've noticed this behavior many times before, in different situations... I've just been lazy not to report the issue.

tashasummers commented 6 years ago

I have also had this issue several times with v4.4 on debian linux, and see the same errors as Claudio posted. It generally occurs after several IOC reboots. I understand the error to mean someone is trying to connect to PVs with a stale ID number. In our case, the IP of the misbehaving client was one of the control room consoles and a reboot of CS-S running on this machine cleared the errors.

Maybe related? I have noticed several machines never reconnect to PVs after an IOC reboot and the only solution is to restart CS-S.

claudio-rosati commented 6 years ago

@berryma4 @kasemir @shroffk Have you some ideas?

kasemir commented 6 years ago

All this tells me is that there are some problems in the CAJ library that we don't understand. I've not experienced this specific one, but I've seen CA gateways complain about

zero length PV name in UDP search request

and right after that the gateway may crash. It's impossible to create a CAJ channel with empty name, you get an exception right away. This must thus be a bug inside CAJ where it for some reason looses the channel name and sends requests without a name.

Kunal started to take over maintenance of CAJ, but I don't think anybody here fully understands the internals. So what we'd need first is reproducible examples.

As long as we cannot reproduce the scenario, one thing that might help, once you somehow do end up in the error situation, is to simply gather more information: Try the same channel with Probe and the PV Tree. Probe uses the PVManager, PV Tree uses vtype.pv, so they use different CAJ contexts. Does one of them still "work"? Then reduce that CSS instance to just one channel, i.e. close all displays and open just one Probe or PV Tree, whichever exhibits the problem. Enable the full log level for JCA/CAJ, then try to connect to a channel. Maybe those log messages will help us understand what's going on.