ControlSystemStudio / phoebus

A framework and set of tools to monitor and operate large scale control systems, such as the ones in the accelerator community.
http://phoebus.org/
Eclipse Public License 1.0
90 stars 89 forks source link

Phoebus on Windows shows disconnected PVs and hangs for certain screens when JCA library hangs in connecting to IOC #1958

Closed tynanford closed 1 year ago

tynanford commented 3 years ago

phoebus-error.txt

We have Phoebus installed on both Windows 10 and Linux. I am noticing an issue specific to our Windows machines.

I attached the entire error log, it seems to happen when opening a screens with many PVs. About 20% of the PVs will show "INVALID - No value" and a few "UNDEFINED - UDF". I can open the same exact screen on Linux and see all these PVs are connected with no errors.

Things I have tried:

All resulted in the same behavior and same looking errors. PVs show disconnected which aren't and trying to close the problematic screen results in Phoebus hanging (in this specific error log attach, I waited until Phoebus did eventually become responsive).

It almost always is the same PVs that are disconnected. I can probe any specific one before opening the screen no problem. But once the screen is opened, probe doesn't work. PV List shows many PVs disconnected. Any ideas?

tynanford commented 3 years ago

Actually, it is not specific to Windows. There is a Linux machine also having similar issues

kasemir commented 3 years ago

Do you see any errors on the IOC side?

In the log, two things stand out.

First, there's the UI freeze when you're closing a display, and that display tries to close/disconnect PVs. The thread that's trying to close the PV is blocked inside the CAJ library:

"JavaFX Application Thread" prio=5 Id=21 BLOCKED on com.cosylab.epics.caj.CAJChannel@745b694 owned by "pool-6-thread-4" Id=53
        at app//com.cosylab.epics.caj.CAJChannel.destroyChannel(CAJChannel.java:403)
        -  blocked on com.cosylab.epics.caj.CAJChannel@745b694
...
        at app//org.phoebus.pv.PVPool.releasePV(PVPool.java:181)
        at app//org.csstudio.display.builder.runtime.pv.RuntimePV.close(RuntimePV.java:167)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@11a62eb6

Meanwhile, another CAJ thread is holding similar locks while trying to connect:

"pool-6-thread-4" prio=5 Id=53 RUNNABLE (in native)
        at java.base@16.0.1/sun.nio.ch.Net.connect0(Native Method)
...
        at java.base@16.0.1/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:851)
        at java.base@16.0.1/java.nio.channels.SocketChannel.open(SocketChannel.java:285)
        at app//com.cosylab.epics.caj.impl.CAConnector.tryConnect(CAConnector.java:179)
        at app//com.cosylab.epics.caj.impl.CAConnector.connect(CAConnector.java:94)
        at app//com.cosylab.epics.caj.CAJContext.getTransport(CAJContext.java:1611)
        ...
        Number of locked synchronizers = 4
        - java.util.concurrent.ThreadPoolExecutor$Worker@67e9560d
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@4362a514
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@21b97f94
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@6f48e513

At first glance I don't see the complete lock loop, but I suspect that's basically it: Trying to close a PV while CAJ is stuck trying to connect a socket.

Which brings up the second issue: CAJ cannot connect to a socket:

com.cosylab.epics.caj.impl.ConnectionException: Failed to connect to '/131.243.89.132:5064'.
        at com.cosylab.epics.caj.impl.CAConnector.connect(CAConnector.java:138)
        at com.cosylab.epics.caj.CAJContext.getTransport(CAJContext.java:1611)
        at com.cosylab.epics.caj.CAJContext.searchResponse(CAJContext.java:1583)
        at com.cosylab.epics.caj.impl.handlers.SearchResponse.internalHandleResponse(SearchResponse.java:108)
...
Caused by: java.net.ConnectException: Connection timed out: connect
        at java.base/sun.nio.ch.Net.connect0(Native Method)

This is actually what's causing the first issue, the lockup, and it seems to be based on a problem in the IOC. We search for a PV "Who has PV XYZ?", we get a response from the IOC "I have it, connect to me at 131.243.89.132:5064", we try to connect --- and that connection times out.

Do you have a firewall which for some reason blocks 131.243.89.132:5064? Is there an error message on that IOC which would explain it?

kasemir commented 3 years ago

Does the screen which causes the problem contain PVs from many different IOCs, and could there be something in the network which limits the number of connections? We recently had a somewhat related problem where network switches were configured to simply not handle more than N different devices at a time, otherwise they declared a "port security violation" and would briefly shut down...

tynanford commented 3 years ago

Does the screen which causes the problem contain PVs from many different IOCs

Yes this is exactly correct. The main problematic screen is an IOC overview/statistics screen that is showing PVs from around 200 IOCs. The screen has been used for about 6 months, and this is the first time we have seen easily reproducible behavior. The switch connection limit is interesting, I can inquire about that.

Do you have a firewall which for some reason blocks 131.243.89.132:5064? Is there an error message on that IOC which would explain it?

This is an old VXWorks IOC and we have issues with these getting in a bad state. I was able to ping the hardware but not get a VXWorks console and can't get any of the PVs. Nothing in the logs either.

Taking the PVs off the screen seems to make things better. Seemed to take a few restarts of the GUI to get all the other PVs to connect. It's okay right now for the stats page and we will debug the problematic IOC.

kasemir commented 3 years ago

OK, so it starts to make some sense.

Bottom line is that there's an ill-behaving CA server, that VxWorks IOC, which replies to the UDP name search with "connect to me at TCP port ..." but then fails to respond to such a connection request. Yes, IOCs can get into that state where for example existing connections continue to be served but no new connections succeed.

The CAJ library seems to handle this poorly. The connection attempt times out after minutes, and until then you might be stuck when trying to close PVs or create/connect other PVs. Not sure exactly if we currently have a way to configure that timeout, and it's unfortunate that it causes a lockup.

There might be ways to avoid the display freeze-on-close, I can look at that, just maybe not right now. Somewhat pessimistic about fixing it in the CAJ lib. But I doubt any of that would make a difference with the original connection problem, where the IOC simply doesn't support another connection once it's in that irregular state.

tynanford commented 3 years ago

IOCs can get into that state where for example existing connections continue to be served but no new connections succeed.

That seems to be it and is why I didn't realize the linux machines were having issues too. It wasn't until I closed and reopened Phoebus where they started to experience the same behavior. And after closing, it never worked again.

Thank you for the assistance, I appreciate it. We can close this issue, unless you want to keep it open for future investigation

tynanford commented 3 years ago

Per discussion at Phoebus meeting today, was told of this similar issue from 2019 and asked to link to it: https://github.com/epics-base/jca/issues/36

tynanford commented 1 year ago

A fix for this has been merged into JCA - https://github.com/epics-base/jca/pull/74