Closed mdavidsaver closed 4 years ago
There was a problem with 'empty PV name' messages reported by the CA server for CAJ clients. It should not be possible to create channels with an empty name, https://github.com/epics-base/jca/blob/master/src/core/com/cosylab/epics/caj/CAJContext.java#L986, but it still happened.
Don't know if that was actually the same problem (empty UDP packet) or an otherwise valid search request with empty name.
At a glance, I notice that neither flushSendBuffer()
or its callers test to see if sendBuffer
is empty.
@shroffk I think Michael has a good suggestion. We could add some check like
if (sendBuffer.limit() > 0)
context.getBroadcastTransport().send(sendBuffer);
else
new Exception("Trying to send empty UDP packet").printStackTrace();
This assumes that the ByteBuffer has been 'flipped' so it's ready to send.
Otherwise need to check position()
instead of limit()
.
I think it's important to see a stack trace to have a chance of determining how we end up in this situation. But that methods throws no exception, and I also don't see a logger, so to minimize the changes we could just print the stack trace right there.
Better would of course be throwing that exception, but that changes the method signature.
Or use java.util.logging.
Ok...we can add the print the exception at the very least to help understand Michaels error. @mdavidsaver if we put this on a branch will that work for you?
I've tried it already, and it doesn't trigger. So my guess is wrong. I tried with position()==0
as well.
I should also mention that I do see, and have seen so long that I've stopped seeing it, an error due to my firewall configuration now allowing CA UDP messages out on my wireless interface.
2019-08-21 18:43:45 WARNING [com.cosylab.epics.caj.impl.BroadcastTransport] Failed to sent a datagram to:/10.255.11.255:5064
java.net.SocketException: Operation not permitted
at java.base/sun.nio.ch.DatagramChannelImpl.send0(Native Method)
at java.base/sun.nio.ch.DatagramChannelImpl.sendFromNativeBuffer(DatagramChannelImpl.java:584)
at java.base/sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:546)
at java.base/sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:529)
at com.cosylab.epics.caj.impl.BroadcastTransport.send(BroadcastTransport.java:238)
at com.cosylab.epics.caj.impl.ChannelSearchManager.flushSendBuffer(ChannelSearchManager.java:238)
at com.cosylab.epics.caj.impl.ChannelSearchManager.access$200(ChannelSearchManager.java:20)
at com.cosylab.epics.caj.impl.ChannelSearchManager$ChannelSearchTimerTask.timeout(ChannelSearchManager.java:66)
at com.cosylab.epics.caj.util.SearchTimer$RunLoop.run(SearchTimer.java:255)
at java.base/java.lang.Thread.run(Thread.java:834)
Seeing this I tried:
try {
context.getBroadcastTransport().send(sendBuffer);
} finally {
initializeSendBuffer();
}
Which doesn't seem to make a difference.
Ok. going further up the stack I think I see it. The failed send() doesn't advance the position()
.
diff --git a/src/core/com/cosylab/epics/caj/impl/BroadcastTransport.java b/src/core/com/cosylab/epics/caj/impl/BroadcastTransport.java
index 2a1ebb9..88841a7 100644
--- a/src/core/com/cosylab/epics/caj/impl/BroadcastTransport.java
+++ b/src/core/com/cosylab/epics/caj/impl/BroadcastTransport.java
@@ -233,12 +233,13 @@ public class BroadcastTransport implements Transport, ReactorHandler {
// prepare buffer
buffer.flip();
- //context.getLogger().finest("Sending " + buffer.limit() + " bytes to " + broadcastAddresses[i] + ".");
+ context.getLogger().warning("Sending " + i + " " + buffer.limit() + " bytes to " + broadcastAddresses[i] + ".");
channel.send(buffer, broadcastAddresses[i]);
}
catch (Throwable ioex)
{
+ buffer.position(buffer.limit());
// TODO what to do here
logger.log(Level.WARNING, "Failed to sent a datagram to:" + broadcastAddresses[i], ioex);
}
The difference between these two cases is that in the first I explicitly specify addr_list to include one of my local interface broadcast addresses. In the second case I leave addr_list empty.
This being due to ordering in the broadcastAddresses
list. The manual addr_list entries appear before the automatic ones.
To me it makes sense that the explicitly specified addr_list should be used first...
What is the expected behaviour here...if someone has set the manual_addr_list, should both the manual and the automatic list be used? In the above case, if jca detected that the manual_addr_list was not set or empty and simply ignored it and moved on to the automatic addr list would that be the correct behaviour?
For the new PVA implementation, I thought it would make sense to not only use the manually added items first, but exclusively, i.e. either use the local interfaces when the addr list is empty, or use the addr list exactly as is when it's defined.
But the EPICS community has 30 years of history with EPICS_CA_AUTO_ADDR_LIST=Yes as a default, so the local interfaces are always expected. Still think they should be used last, after manually entered items. Since we're talking about UDP, though, I'm not sure you can have any expectations as to who answers "first", since you can't know which server sees a search request in what order.
specified addr_list should be used first
I'm sorry if I wasn't clear. Ordering should not make any difference. That it does is the bug.
I notice that JCA is sending zero length UDP packets with destination 5064. These seem to be sent just after a search is (or would be) sent.
Oddly, I also observe these being sent in a situation where (I think) a valid search should be sent, but isn't.
The difference between these two cases is that in the first I explicitly specify addr_list to include one of my local interface broadcast addresses. In the second case I leave addr_list empty.
vs.
In the second case it seems that no searches are sent at all, so PVs are never connected (which is why I was looking).
My
192.168.210.1/24
interface is both UP and RUNNING, so I think that it should be included in the auto address list. It is not the default route (packets set to 255.255.255.255 go elsewhere). Given that the zero length packets are being sent, it is being found somehow.I wonder if these aren't two separate problems with the searching process.
This is with a current build of phoebus master branch (and whatever jca build it pulls in).