epics-base / jca

Java Channel Access client API
https://www.javadoc.io/doc/org.epics/jca/latest/index.html
Other
8 stars 14 forks source link

All new connections stall if IOC is not opening channels #36

Open willrogers opened 5 years ago

willrogers commented 5 years ago

We have a specific problem that is causing problems making connections in CS-Studio.

An IOC is responding to UDP broadcasts, but it is then not possible to open a channel using SocketChannel.open(), which times out after some unspecified time, about 1 minute in this case. Because the code in CAConnector.java calls SocketChannel.open() three times in succession, JCA takes 3 minutes before it fails and tries again to make a connection. If there are multiple channels from the same IOC it will effectively block all new connections in CS-Studio.

The least invasive way to improve this is to put a timeout on these connections, which can be done something like this:

SocketChannel sc = SelectorProvider.provider().openSocketChannel();
sc.configureBlocking(false);
sc.connect(address);
for (int i = 0; i < 20; i++) {
    if (sc.finishConnect()) {
         return sc;
    }
    Thread.sleep(50);
}

Returning the SocketChannel before it has connected is more tricky.

kasemir commented 5 years ago

Do I understand correctly that this is about an IOC or other CA server which responds to the UDP name search with "I know that PV name, talk to me on TCP port X", but then doesn't react to connections via TCP port X? So caget on the command line also won't get any data? Is that a temporary issue, or do you need to restart the IOC?

willrogers commented 5 years ago

Yes, that's exactly right. It must be a fairly unusual case. We have had to restart this IOC each time it has got into this state.

ralphlange commented 5 years ago

Can e.g. happen if the IOC leaks resources and reaches the per-process limit for file descriptors. (Do you run iocStats? That should show the FD usage.) Or if someone blocks TCP per firewall but leaves UDP open. Even if it is not normal, that situation should be handled gracefully by the client.

willrogers commented 5 years ago

Well, this IOC was not sufficiently well to report iocStats, but I agree we need to be able to handle this on the client side.

My proposed fix still blocks for x seconds per failed connection. That raises two questions:

rjwills28 commented 1 year ago

I have been investigating this issue further as we have seen it re-occur more often. I created a small client using the JCA library to establish a connection to PV channels. I use this code to first try to connect to a PV located in a broken IOC where the TCP port has been disabled, and then try to connect to a PV on a healthy IOC.

The fundamental issue is that the JCA lib sends a search broadcast out to locate the IOC that contains the requested PV. Following this, the broken IOC responds via UDP to say that it has the PV. The JCA lib then tries to establish a TCP connection and this fails throwing an exception.

I'm not sure that the lack of a timeout on the connection is the issue. It is true that it tries to create the TCP connection 3 times and then throws an exception, however it actually only waits a very short time between these tries and so I don't think this is the issue.

What I think is happening is that after the TCP connection fails, it continues to broadcast the search for the IOC containing the PV and so once again it will receive a response from the broken IOC saying that it has that PV. It tries again to establish a TCP connection, which fails and so the search broadcast runs again, etc. We end up in a never ending cycle like this. Indeed, you can see the exceptions printed over and over again until the program finishes (or CS-Studio gets closed).

Then we try to connect to a PV in the healthy IOC. The healthy IOC responds and eventually the JCA lib will try to establish a TCP connection with it but because it is busy responding to the broken IOC's response, this can take some time. In most cases, if you make the timeout on the connection to the healthy PV long enough you are able to get a connection.

I feel like the solution should be to not continue searching for the PV if the TCP connection fails.

kasemir commented 1 year ago

I feel like the solution should be to not continue searching for the PV if the TCP connection fails

A CA server that replies to the search requests but then fails to handle the expected TCP connection is of course nasty. But eventually you will notice that all the PVs from that IOC no longer resolve, and restart that IOC.

If the clients now indeed stopped searching, they won't connect to the restarted IOC. So you'd have to restart all clients as well, which I don't think we want.

TCP connection... fails and so the search broadcast runs again, etc. We end up in a never ending cycle

That seems fundamentally correct: As long as a PV is not connected, the client searches. Shouldn't matter if we just started, never connected, or were connected and then got disconnected. Either way, we keep searching ... at some low search period ... and need to do that without being completely stuck when IOCs reply to UDP but then not TCP.

Message ID: @.***>

ralphlange commented 1 year ago

I agree with Kay that the underlying ideas are correct and intentional.

However, I would consider it a serious flaw if the JCA client is getting into a loop so tight that it can't handle healthy IOCs anymore. That would mean such a "nasty" IOC could bring some, maybe all, JCA clients in a control system down - which is pretty much a DoS attack where - opposed to the usual setup - a bad server attacks clients.

aawdls commented 1 year ago

I am proposing this for the codeathon next week. With this narrower scope I think it should be soluble.

rjwills28 commented 1 year ago

Thanks for the comments. I have come up with a possible solution for this specific issue and created a draft PR with the fix, see https://github.com/epics-base/jca/pull/74. Any feedback on this idea would be appreciated.

rjwills28 commented 1 year ago

We have come across this issue again when trying to connect to an IP address that does not respond correctly. The connection to the 'bad' IP address blocks any other connections, which causes CS-Studio screens to display no data for a long time as no connections can be made. The previous fix in https://github.com/epics-base/jca/pull/74 does help but the biggest problem is that the SocketChannel.open(address): https://github.com/epics-base/jca/blob/f247584babe9d92904eb763b0bd823fa89e4b04e/src/core/com/cosylab/epics/caj/impl/CAConnector.java#L205 blocks for ~120 secs. It tries this call 3 times in the tryConnect() method meaning it blocks any other connection from being made for ~ 6 minutes. The previous fix means that there is then a small delay before it tries to connect again to this 'bad' IP meaning that some successful connections to 'OK' IPs can be made, but still means we have to wait a long time for the screen to populate.

To illustrate that this is the cause of the problem, I tested changing the SocketChannel.open(address) so that I can add a timeout for the connection. See the example implementation here: https://github.com/epics-base/jca/compare/master...rjwills28:jca:socket_connect_timeout. I used a relatively small timeout for our tests but it did fix the problem as we eliminated the 6 mins of blocking time. Obviously this timeout would not work in all situations (slow network, switches etc) making this a tricky one to fix in a generic way.

Does anyone have any ideas/views/thoughts on how we might be able to get around this problem?

kasemir commented 1 year ago

I think the workaround is as before: We don't want to drop the channel from the search because eventually there's hope that the IOC gets fixed and then we want to connect successfully. But hanging 120 sec in the TCP connect call is clearly too long, and making it configurable sounds good. As you write about your test, a SOCKET_CONNECT_TIMEOUT of 100ms indeed seems short, maybe go with on the order of a few seconds as a default?

rjwills28 commented 12 months ago

Thanks for the feedback. I have opened a PR with the suggested changes, including making the connection timeout configurable. Let me know any other thoughts you have about this workaround.

kasemir commented 3 months ago

For what it's worth, the underlying situation is not limited to a broken IOC/CA server. It can also happen when a firewall is open to only port UDP and TCP 5064. For the first IOC on such a host, the CA server running on TCP 5064 can communicate fine. For the second IOC on such a host, the CA server is running on an arbitrary TCP port. It will reply to a UDP search with "Talk to me on that TCP port", but the client then cannot establish the associated TCP connection because of the firewall.

kasemir commented 3 months ago

I think this is fixed with #78. Close?