Open claudio-rosati opened 6 years ago
I have also had this issue several times with v4.4 on debian linux, and see the same errors as Claudio posted. It generally occurs after several IOC reboots. I understand the error to mean someone is trying to connect to PVs with a stale ID number. In our case, the IP of the misbehaving client was one of the control room consoles and a reboot of CS-S running on this machine cleared the errors.
Maybe related? I have noticed several machines never reconnect to PVs after an IOC reboot and the only solution is to restart CS-S.
@berryma4 @kasemir @shroffk Have you some ideas?
All this tells me is that there are some problems in the CAJ library that we don't understand. I've not experienced this specific one, but I've seen CA gateways complain about
zero length PV name in UDP search request
and right after that the gateway may crash. It's impossible to create a CAJ channel with empty name, you get an exception right away. This must thus be a bug inside CAJ where it for some reason looses the channel name and sends requests without a name.
Kunal started to take over maintenance of CAJ, but I don't think anybody here fully understands the internals. So what we'd need first is reproducible examples.
As long as we cannot reproduce the scenario, one thing that might help, once you somehow do end up in the error situation, is to simply gather more information: Try the same channel with Probe and the PV Tree. Probe uses the PVManager, PV Tree uses vtype.pv, so they use different CAJ contexts. Does one of them still "work"? Then reduce that CSS instance to just one channel, i.e. close all displays and open just one Probe or PV Tree, whichever exhibits the problem. Enable the full log level for JCA/CAJ, then try to connect to a channel. Maybe those log messages will help us understand what's going on.
One of my users has the following problem.
After restarting the IOC my 4.5.1.0 (CR: the one built over the current master branch) Linux CSS was unable to reconnect.
These messages appear in CSS Console:
On IOC side these messages appear:
Restarting IOC again does not solve the problem. Only a restart of CSS solved the problem.
The OPI comes from here https://github.com/areaDetector/ADCore/blob/master/ADApp/op/opi/autoconvert/NDFileHDF5.opi.
I (CR: the user) do not think the problem is specific to that OPI since I've noticed this behavior many times before, in different situations... I've just been lazy not to report the issue.