Fix for blocking failed tcp connections

rjwills28 commented 1 year ago

This PR is a proposed fix for the issue described in #36 .

Summary of the issue: If an IOC is in an environment where it can respond to UDP broadcasts but not to TCP requests, then any subsequent TCP connection to a healthy IOC is blocked. In our use-case we found that CS-Studio (which uses the JCA library) would take a long time to display values of PVs from a healthy IOC if the screen also contained PVs from an ill IOC.

Issue: each time the ill IOC responds to the UDP broadcast, the JCA library spends time (~3 secs) trying to establish a TCP connection. This fails and so the UDP broadcast is sent out again, which gets a response and again it tries to make the TCP connection. This essentially ends up blocking any other TCP connections to healthy IOCs.

Solution: in the linked issue above we have discussed that if the TCP connection fails we can't similar stop any future connection attempts as then the client will not connect when the ill IOC is fixed. Instead I have tested adding an increasing delay between each successive TCP connection attempt to an IP address if it should fail. The assumption here is that the underlying reason that an 'ill' IOC cannot respond to the TCP request is due to a networking problem on the host server and so we can based the fix on the specific IP address. The increasing delay means that we do not tie up the TCP connection code with the unsuccessful attempts and it allows other connections to healthy IOCS to be made.

This is a relatively 'simple' fix with minimal changes and does not require a larger re-work of the connection process. Any thoughts/comments/feedback would be appreciated as to whether this is a suitable/beneficial fix.

kasemir commented 1 year ago

Fundamentally looks good to me, but I'm not sure about the concurrency so to be on the safe side I'd use a ConcurrentHashMap, and then instead of

if (map.containsKey(address)) 
  map.get(address)

I'd use

value = map.get(address);
if (value != null)
   ..

and instead of

if (contains...)
   ..get
else
  put

I'd suggest the atomic operations (computeIfAbsent etc) offered by the concurrent hash map. In the javadoc for https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html, this is basically what you want to do: freqs.computeIfAbsent(k -> new LongAdder()).increment();

rjwills28 commented 1 year ago

Thanks for the feedback and good idea. I have implemented your suggestion in my latest commit.

rjwills28 commented 1 year ago

@kasemir, @ralphlange would it be possible to get a review on this updated PR?

Note: I have a separate PR to fix the build on Java 16, which is not related to this change (see https://github.com/epics-base/jca/pull/75).

kasemir commented 1 year ago

Sorry for the delay

kasemir commented 1 year ago

There is a test failure because of some jacoco class file version, I assume that's #75

@shroffk should we merge this?

ralphlange commented 1 year ago

Looks reasonable to me.

epics-base / jca

Fix for blocking failed tcp connections #74