Closed ralphlange closed 2 years ago
BOY seems to be able to connect to the same channels fine, on the other hand.
The logs show unusual things, as if the IOCs would constantly reconnect. 2022-09-13T08:25:26.083Z INFO [Thread 53] com.cosylab.epics.caj.impl.CABeaconHandler (updateBeaconPeriod) - New server beacon /10.130.1.108:35369
These are printouts added in #53 In principle, they should simply show what was always happening as the CA client checks received beacons for anomalies.
At the SNS, we experienced problems with CA clients being too sensitive to changes in beacon timing, resulting in bursts of search traffic up to the point where clients would continually search. In #53, we added log messages to show how the client reacts, and related settings that used to be fixed became configurable, but with default values matching the original settings.
So while you can now configure the client to search less often, the fact that you see the log messages should mean that your clients are still searching. Still, you could run caSnooper or wireshark to verify if the clients are still searching, that these are just new informational messages, or to see if there is indeed an unintended side effect which results in these clients not actually searching and thus never getting a search reply, so they never connect.
I think the beacon-related messages and configuration options are already in 2.4.6. The main changes in 2.4.7 seem to be build settings, some detail of UDP vs TCP port configurability https://github.com/epics-base/jca/commit/288de285942811d6dc37b2366503def8deb448ed and echo timeout setting https://github.com/epics-base/jca/commit/7173c03c5620e842849fedbaced2402e4c49e4b1
We're getting a bit closer here...
The test setup that the logs are from is using an arbitrary port for CA, which is configured in the IOCs and the tools (DIIRT for BOY, environment for the others) when the test is run.
The bad case logs show that the tools are connecting to every available IOC on the network on the standard port, but not to the IOCs they should connect to.
It seems that setting the port (for the client) through EPICS_CA_SERVER_PORT
is broken.
Possibly a side effect of merging slominskir/patch-7, i.e. 288de28 ?? We're trying to verify that by reverting that commit from 2.4.7.
It seems that setting the port (for the client) through EPICS_CA_SERVER_PORT is broken.
That would be a nice explanation and somewhat easy to fix.
Nadine is testing a 2.4.7 with Ryan's merge commit reverted.
Reverting that merge commit doesn't change anything. Duh.
Turns out we're not using env vars for the tools. Both CS-Studio/BOY and the archive/alarm servers are configured through DIIRT properties, which are set using random numbers like:
org.csstudio.diirt.util.core.preferences/diirt.ca.repeater.port=25146
org.csstudio.diirt.util.core.preferences/diirt.ca.server.port=32758
The test IOCs are getting configured accordingly using env vars. They are reporting the right settings (checked from the iocShell).
Well, the changes were meant primarily for cas
but the BroadcastTransport implementation is used by both the CAJServerContext and the CAJContext so it would have been great if that was the cause.
For some reason, updating the jca jar from 2.4.6 to 2.4.7 led to the DIIRT properties/preferences being ignored in case of BEAST/BEAUTY, while the same properties/preferences are still working for BOY.
The Eclipse-based code juggles several settings which are copied between Eclipse preferences, DIIRT configuration files, java properties and environment settings, one of them being https://github.com/ControlSystemStudio/cs-studio/tree/master/core/diirt/diirt-plugins/org.csstudio.diirt.util.core.preferences. Not sure how changing the JCA jar would influence that, but having say the diirt.util.core.preferences plugin in the GUI yet not in the headless services would explain their different behavior.
Thanks a lot for all your help!!!
It was a false alarm in the end. (Of course...) Buried in the complexity of our systemd startup and control mechanism is a script that gets used for the background daemons (not for CS-Studio/BOY) by the 'regular' start/stop mechanism and by the test procedures. A change in the systemd scripts - done by a different colleague on the same day - dropped the ability to set specific .ini files on the command line, as that feature was not used for regular operation. It was not obvious that the tests would use that and fail to set different ports for the archive engine and the alarm server.
Apologies for the noise.
I love self-solving issues! Thanks for the update
Going from JCA 2.4.6 to JCA 2.4.7, we see a strange behavior: when used within BEAST or BEAUTY, the tools will not connect to channels. The logs show unusual things, as if the IOCs would constantly reconnect.
Can you relate that to any recent change in JCA?
For additional verification, I will be creating a 2.4.6 jar with only my echoTimeout changes added.
Here are logs when using JCA 2.4.6 - the good case. Archive engine:
Alarm server:
The same startup logs, using JCA 2.4.7 - the bad case.
Archive engine:
Alarm server: