ControlSystemStudio / phoebus

A framework and set of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
http://phoebus.org/
Eclipse Public License 1.0
90 stars 89 forks source link

Alarm: CA connection becomes unstable when PVs are deleted #1936

Closed sasaki77 closed 3 years ago

sasaki77 commented 3 years ago

Bug description CA connection between Alarm server and IOC becomes unstable when Alarm PVs are deleted. The CA connection becomes disconnected and connected repeatedly.

To Reproduce

  1. There are 2 IOCs. IOC A has 500 PVs, IOC B has 49500 PVs.
  2. Alarm server is configured to monitor 50000 PVs (500 + 49500) in 2 IOCs.
  3. Start Alarm server.
  4. Open Phoebus Alarm tree GUI. Delete 400 Alarm PVs running on IOC A.
  5. The CA connection becomes unstable. The CA connection between Alarm server and IOC A becomes disconnected and connected repeatedly.
  6. The CA connection between Alarm server and IOC B has no problem.

Workaround Comment out these lines not to send config delete message.

https://github.com/ControlSystemStudio/phoebus/blob/3d46002b4c58d06aeec97d0eecdf84d5e7332cf3/app/alarm/model/src/main/java/org/phoebus/applications/alarm/client/AlarmClient.java#L542-L544

Alarm server stops and starts alarm PV when it receives Config delete message. Then It stops alarm PV when it receives the tombstone marker. Following is an example log.

2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Stop Alarm00000
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Start Alarm00000
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Detach /Accelerator/TEST/Alarm00000 from parent
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Stop Alarm00000

We can just stop the alarm PVs if config delete messages is not sent. I'm not sure why, but we can delete Alarm PVs without problem in this situation.

Additional context

Logs when PVs are deleted Alarm server log

2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Stop Alarm00000
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Start Alarm00000
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Detach /Accelerator/TEST/Alarm00000 from parent
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Stop Alarm00000
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Stop Alarm00001
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Start Alarm00001
2021-06-21 11:13:37 FINE [org.phoebus.applications.alarm] Detach /Accelerator/TEST/Alarm00001 from parent
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00491 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00491 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.024, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.024
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00459 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00459 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.029, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.029
2021-06-21 11:13:38 WARNING [org.phoebus.pv] Channel Access Exception from com.cosylab.epics.caj.CAJContext@67f275e0 for channel CAJChannel = { name = Alarm00001, connectionState = CONNECTED }: User destroyed channel
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] Stop Alarm00001
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00435 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00435 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.063, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.063
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00491 changes to OK/NO_ALARM (0.0), 2021-06-21 11:13:37.926, OK/NO_ALARM (0.0), 2021-06-21 11:13:37.926
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00290 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00290 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.099, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.099
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00491 received VDouble[0.0, NONE(NONE) - NO_ALARM, 2021-06-21T02:13:37.926066615Z] -> Current: OK/NO_ALARM (0.0), 2021-06-21 11:13:37.926, Alarm: OK/NO_ALARM (0.0), 2021-06-21 11:13:37.926
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00386 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00386 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.129, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.129
2021-06-21 11:13:38 FINE [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00005 disconnected
2021-06-21 11:13:38 FINER [org.phoebus.applications.alarm] /Accelerator/TEST/Alarm00005 changes to UNDEFINED/Disconnected (), 2021-06-21 11:13:38.141, UNDEFINED/Disconnected (), 2021-06-21 11:13:38.141
2021-06-21 11:13:38 SEVERE [com.cosylab.epics.caj.CAJChannel]
java.lang.IllegalStateException: transport closed
        at com.cosylab.epics.caj.impl.CATransport.submit(CATransport.java:908)
        at com.cosylab.epics.caj.impl.requests.AbstractCARequest.submit(AbstractCARequest.java:88)
        at com.cosylab.epics.caj.CAJMonitor.clear(CAJMonitor.java:141)
        at com.cosylab.epics.caj.CAJChannel.destroyAllMonitors(CAJChannel.java:1213)
        at com.cosylab.epics.caj.CAJChannel.destroyChannel(CAJChannel.java:414)
        at com.cosylab.epics.caj.CAJContext.destroyChannel(CAJContext.java:1069)
        at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:383)
        at com.cosylab.epics.caj.CAJChannel.destroy(CAJChannel.java:368)
        at gov.aps.jca.Channel.dispose(Channel.java:710)
        at org.phoebus.pv.ca.JCA_PV.close(JCA_PV.java:613)
        at org.phoebus.pv.PVPool.releasePV(PVPool.java:181)
        at org.phoebus.applications.alarm.server.AlarmServerPV.stop(AlarmServerPV.java:375)
        at org.phoebus.applications.alarm.server.ServerModel.checkUpdates(ServerModel.java:188)
        at org.phoebus.applications.alarm.server.ServerModel.run(ServerModel.java:125)
        at java.base/java.lang.Thread.run(Thread.java:832)
...
2021-06-21 11:13:38 WARNING [org.phoebus.pv] Channel Access Exception from com.cosylab.epics.caj.CAJContext@67f275e0 for channel CAJChannel = { name = Alarm00206, connectionState = DISCONNECTED }: Virtual circuit disconnect
2021-06-21 11:13:38 SEVERE [com.cosylab.epics.caj.CAJChannel]
java.io.IOException: Broken pipe
        at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
        at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
        at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:484)
        at com.cosylab.epics.caj.impl.CATransport.noSyncSend(CATransport.java:685)
        at com.cosylab.epics.caj.impl.CATransport.submit(CATransport.java:886)
        at com.cosylab.epics.caj.impl.requests.EventAddRequest.resubscribeSubscription(EventAddRequest.java:170)
        at com.cosylab.epics.caj.CAJChannel.resubscribeSubscriptions(CAJChannel.java:1300)
        at com.cosylab.epics.caj.CAJChannel.connectionCompleted(CAJChannel.java:354)
        at com.cosylab.epics.caj.impl.handlers.CreateChannelResponse.internalHandleResponse(CreateChannelResponse.java:48)
        at com.cosylab.epics.caj.impl.handlers.AbstractCAResponseHandler.handleResponse(AbstractCAResponseHandler.java:110)
        at com.cosylab.epics.caj.impl.CAResponseHandler.handleResponse(CAResponseHandler.java:139)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:519)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:431)
        at com.cosylab.epics.caj.impl.CATransport.handleEvent(CATransport.java:369)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.handleEvent(LeaderFollowersHandler.java:77)
        at com.cosylab.epics.caj.impl.reactor.Reactor.processInternal(Reactor.java:404)
        at com.cosylab.epics.caj.impl.reactor.Reactor.process(Reactor.java:288)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.run(LeaderFollowersHandler.java:91)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)

2021-06-21 11:13:38 WARNING [org.phoebus.pv] Alarm00348 cannot get meta data
java.lang.IllegalStateException: No channel transport available, channel disconnected.
        at com.cosylab.epics.caj.CAJChannel.get(CAJChannel.java:989)
        at org.phoebus.pv.ca.JCA_PV.getMetaData(JCA_PV.java:231)
        at org.phoebus.pv.ca.JCA_PV.connectionChanged(JCA_PV.java:207)
        at gov.aps.jca.event.DirectEventDispatcher.dispatch(DirectEventDispatcher.java:56)
        at com.cosylab.epics.caj.CAJChannel.setConnectionState(CAJChannel.java:654)
        at com.cosylab.epics.caj.CAJChannel.connectionCompleted(CAJChannel.java:355)
        at com.cosylab.epics.caj.impl.handlers.CreateChannelResponse.internalHandleResponse(CreateChannelResponse.java:48)
        at com.cosylab.epics.caj.impl.handlers.AbstractCAResponseHandler.handleResponse(AbstractCAResponseHandler.java:110)
        at com.cosylab.epics.caj.impl.CAResponseHandler.handleResponse(CAResponseHandler.java:139)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:519)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:431)
        at com.cosylab.epics.caj.impl.CATransport.handleEvent(CATransport.java:369)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.handleEvent(LeaderFollowersHandler.java:77)
        at com.cosylab.epics.caj.impl.reactor.Reactor.processInternal(Reactor.java:404)
        at com.cosylab.epics.caj.impl.reactor.Reactor.process(Reactor.java:288)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.run(LeaderFollowersHandler.java:91)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)

2021-06-21 11:13:38 SEVERE [com.cosylab.epics.caj.impl.CATransport]
java.lang.NullPointerException: Cannot invoke "com.cosylab.epics.caj.impl.CATransport.getMinorRevision()" because "this.transport" is null
        at com.cosylab.epics.caj.CAJChannel.connectionCompleted(CAJChannel.java:338)
        at com.cosylab.epics.caj.impl.handlers.CreateChannelResponse.internalHandleResponse(CreateChannelResponse.java:48)
        at com.cosylab.epics.caj.impl.handlers.AbstractCAResponseHandler.handleResponse(AbstractCAResponseHandler.java:110)
        at com.cosylab.epics.caj.impl.CAResponseHandler.handleResponse(CAResponseHandler.java:139)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:519)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:431)
        at com.cosylab.epics.caj.impl.CATransport.handleEvent(CATransport.java:369)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.handleEvent(LeaderFollowersHandler.java:77)
        at com.cosylab.epics.caj.impl.reactor.Reactor.processInternal(Reactor.java:404)
        at com.cosylab.epics.caj.impl.reactor.Reactor.process(Reactor.java:288)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.run(LeaderFollowersHandler.java:91)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)

2021-06-21 11:13:38 SEVERE [com.cosylab.epics.caj.impl.CATransport]
java.lang.NullPointerException: Cannot invoke "com.cosylab.epics.caj.impl.CATransport.getMinorRevision()" because "this.transport" is null
        at com.cosylab.epics.caj.CAJChannel.connectionCompleted(CAJChannel.java:338)
        at com.cosylab.epics.caj.impl.handlers.CreateChannelResponse.internalHandleResponse(CreateChannelResponse.java:48)
        at com.cosylab.epics.caj.impl.handlers.AbstractCAResponseHandler.handleResponse(AbstractCAResponseHandler.java:110)
        at com.cosylab.epics.caj.impl.CAResponseHandler.handleResponse(CAResponseHandler.java:139)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:519)
        at com.cosylab.epics.caj.impl.CATransport.processRead(CATransport.java:431)
        at com.cosylab.epics.caj.impl.CATransport.handleEvent(CATransport.java:369)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.handleEvent(LeaderFollowersHandler.java:77)
        at com.cosylab.epics.caj.impl.reactor.Reactor.processInternal(Reactor.java:404)
        at com.cosylab.epics.caj.impl.reactor.Reactor.process(Reactor.java:288)
        at com.cosylab.epics.caj.impl.reactor.lf.LeaderFollowersHandler.run(LeaderFollowersHandler.java:91)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)
...

IOC A Log

CAS: forcing disconnect from 127.0.0.1:38454
CAS: request from 127.0.0.1:38492 => bad resource ID
CAS: Request from 127.0.0.1:38492 => cmmd=2 cid=0x456 type=20 count=1 postsize=0
CAS: Request from 127.0.0.1:38492 =>   available=0x440  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38492
CAS: request from 127.0.0.1:38494 => bad resource ID
CAS: Request from 127.0.0.1:38494 => cmmd=2 cid=0x469 type=20 count=1 postsize=0
CAS: Request from 127.0.0.1:38494 =>   available=0x453  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38494
CAS: request from 127.0.0.1:38496 => bad resource ID
CAS: Request from 127.0.0.1:38496 => cmmd=2 cid=0x677 type=20 count=1 postsize=0
CAS: Request from 127.0.0.1:38496 =>   available=0x466  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38496
CAS: request from 127.0.0.1:38498 => bad resource ID
CAS: Request from 127.0.0.1:38498 => cmmd=1 cid=0x789 type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38498 =>   available=0x6e1  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38498
CAS: request from 127.0.0.1:38500 => bad resource ID
CAS: Request from 127.0.0.1:38500 => cmmd=1 cid=0x7b1 type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38500 =>   available=0x704  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38500
CAS: request from 127.0.0.1:38502 => bad resource ID
CAS: Request from 127.0.0.1:38502 => cmmd=1 cid=0x7c0 type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38502 =>   available=0x6c4  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38502
CAS: request from 127.0.0.1:38504 => bad resource ID
CAS: Request from 127.0.0.1:38504 => cmmd=1 cid=0x949 type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38504 =>   available=0x64a  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38504
CAS: request from 127.0.0.1:38506 => bad resource ID
CAS: Request from 127.0.0.1:38506 => cmmd=1 cid=0x95e type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38506 =>   available=0x7ee  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38506
CAS: request from 127.0.0.1:38508 => bad resource ID
CAS: Request from 127.0.0.1:38508 => cmmd=1 cid=0xb68 type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38508 =>   available=0x686  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38508
CAS: request from 127.0.0.1:38510 => bad resource ID
CAS: Request from 127.0.0.1:38510 => cmmd=15 cid=0xb7d type=34 count=1 postsize=0
CAS: Request from 127.0.0.1:38510 =>   available=0x18c7e    N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38510
CAS: request from 127.0.0.1:38512 => bad resource ID
CAS: Request from 127.0.0.1:38512 => cmmd=2 cid=0x569 type=20 count=1 postsize=0
CAS: Request from 127.0.0.1:38512 =>   available=0x6a7  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38512
CAS: request from 127.0.0.1:38514 => bad resource ID
CAS: Request from 127.0.0.1:38514 => cmmd=15 cid=0xce0 type=34 count=1 postsize=0
CAS: Request from 127.0.0.1:38514 =>   available=0x18ca9    N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38514
CAS: forcing disconnect from 127.0.0.1:38516
CAS: request from 127.0.0.1:38518 => bad resource ID
CAS: Request from 127.0.0.1:38518 => cmmd=2 cid=0xe4a type=20 count=1 postsize=0
CAS: Request from 127.0.0.1:38518 =>   available=0x6b9  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38518
CAS: request from 127.0.0.1:38520 => bad resource ID
CAS: Request from 127.0.0.1:38520 => cmmd=1 cid=0xf2b type=20 count=1 postsize=16
CAS: Request from 127.0.0.1:38520 =>   available=0x818  N=0 paddr=(nil)
CAS: forcing disconnect from 127.0.0.1:38520
CAS: forcing disconnect from 127.0.0.1:38522
...
kasemir commented 3 years ago

Alarm server stops and starts alarm PV when it receives Config delete message. Then It stops alarm PV when it receives the tombstone marker.

The delete message is meant to be informational, allowing a logger to track what user, on what host, deleted a channel. The actual deletion is the null tombstone.

If the alarm server reacts to both, that would be a bug. The alarm server needs to react to config changes (stop PV, apply config changes, then restart the PV), but maybe it's considering the 'delete' message a config change when it should better ignore the delete message, and then just delete the PV when the tombstone is received.

kasemir commented 3 years ago

Updated alarm server so it will ignore the informational config "delete" message, avoiding that extraneous stop/restart. It will only react to the final config null message, deleting the PV once. That should avoid the issue you see, just as removing the "delete" message did, yet keeping the "delete" message for informational purpose (to log who triggered the deletion).

It doesn't fully explain what's happening on the channel access level. Unclear what causes the "bad resource ID" which then results in the IOC closing the network connection, leaving all channels from the IOC being disconnected until the client tries to re-connect. That would best be explored with a simpler test that just tries to stop/start many PVs without all the remaining alarm logic. Can you provide the script you use to create those many Alarm00000 records?

sasaki77 commented 3 years ago

I confirmed that the alarm server can delete alarm PVs stably.

The script is available from the following link. The db file for 50000 alarm PVs is in demo directory.

https://github.com/sasaki77/alarm-test-script

kasemir commented 3 years ago

Thanks!

sasaki77 commented 3 years ago

I'm closing this issue for now. Connection problem still remain, but that might be a connection layer problem not the alarm server problem.

kasemir commented 3 years ago

I agree that the alarm server is now avoiding the issue, but it might still exist. I plan to look at that within the upcoming weeks with test code that connects/disconnects in a manner similar to what the alarm server did before we improved the 'delete' handling.

kasemir commented 3 years ago

So far haven't been able to reproduce the "bad resource ID" on the IOC which caused it to close the connection, leading to the follow-up problems. What version of EPICS base are you using to run the IOC? I'm currently testing with EPICS R7.0.4.1

sasaki77 commented 3 years ago

I have tested with EPICS R3.15.7. The test environment is as followings.