Closed sasaki77 closed 3 years ago
Alarm server stops and starts alarm PV when it receives Config delete message. Then It stops alarm PV when it receives the tombstone marker.
The delete message is meant to be informational, allowing a logger to track what user, on what host, deleted a channel. The actual deletion is the null
tombstone.
If the alarm server reacts to both, that would be a bug. The alarm server needs to react to config changes (stop PV, apply config changes, then restart the PV), but maybe it's considering the 'delete' message a config change when it should better ignore the delete message, and then just delete the PV when the tombstone is received.
Updated alarm server so it will ignore the informational config "delete" message, avoiding that extraneous stop/restart.
It will only react to the final config null
message, deleting the PV once.
That should avoid the issue you see, just as removing the "delete" message did, yet keeping the "delete" message for informational purpose (to log who triggered the deletion).
It doesn't fully explain what's happening on the channel access level. Unclear what causes the "bad resource ID" which then results in the IOC closing the network connection, leaving all channels from the IOC being disconnected until the client tries to re-connect. That would best be explored with a simpler test that just tries to stop/start many PVs without all the remaining alarm logic. Can you provide the script you use to create those many Alarm00000 records?
I confirmed that the alarm server can delete alarm PVs stably.
The script is available from the following link. The db file for 50000 alarm PVs is in demo
directory.
Thanks!
I'm closing this issue for now. Connection problem still remain, but that might be a connection layer problem not the alarm server problem.
I agree that the alarm server is now avoiding the issue, but it might still exist. I plan to look at that within the upcoming weeks with test code that connects/disconnects in a manner similar to what the alarm server did before we improved the 'delete' handling.
So far haven't been able to reproduce the "bad resource ID" on the IOC which caused it to close the connection, leading to the follow-up problems. What version of EPICS base are you using to run the IOC? I'm currently testing with EPICS R7.0.4.1
I have tested with EPICS R3.15.7. The test environment is as followings.
Bug description CA connection between Alarm server and IOC becomes unstable when Alarm PVs are deleted. The CA connection becomes disconnected and connected repeatedly.
To Reproduce
Workaround Comment out these lines not to send config delete message.
https://github.com/ControlSystemStudio/phoebus/blob/3d46002b4c58d06aeec97d0eecdf84d5e7332cf3/app/alarm/model/src/main/java/org/phoebus/applications/alarm/client/AlarmClient.java#L542-L544
Alarm server stops and starts alarm PV when it receives Config delete message. Then It stops alarm PV when it receives the tombstone marker. Following is an example log.
We can just stop the alarm PVs if config delete messages is not sent. I'm not sure why, but we can delete Alarm PVs without problem in this situation.
Additional context
Logs when PVs are deleted Alarm server log
IOC A Log