Closed lpetre-ulb closed 4 years ago
Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.
I would tend to agree, as @lpetre-ulb is going to be submitting a PR to the upstream, we should try to have the version we are using as vanilla as possible, especially if the patch put in upstream solves these (old) issues.
One question/request for @lpetre-ulb, can you do an quick investigation on a uhal
call that sends multiple (~10-100) transactions within a single dispatch
?
I have seen that when more than 5-10 requests are bundled together, the CTP7 IPBus
server seems to choke. If you like, I can probably dig up the script I was using to test this.
Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR
One question/request for @lpetre-ulb, can you do an quick investigation on a
uhal
call that sends multiple (~10-100) transactions within a singledispatch
? I have seen that when more than 5-10 requests are bundled together, the CTP7IPBus
server seems to choke. If you like, I can probably dig up the script I was using to test this. Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PR
Yes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?
If the ipbus
daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than the recv
buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.
One question/request for @lpetre-ulb, can you do an quick investigation on a
uhal
call that sends multiple (~10-100) transactions within a singledispatch
? I have seen that when more than 5-10 requests are bundled together, the CTP7IPBus
server seems to choke. If you like, I can probably dig up the script I was using to test this. Jes had suggested doing some network traffic packet analysis, which are probably similar to how you investigated this issue, and if possible/simple, it would be good to get a fix for that included in any upstream PRYes, of course. Is there a specific metric I should look at? Transaction/second, words/second, packet/second, latency, ... Also does the type of transaction matter?
If the
ipbus
daemon is overloaded with dead TCP connection and if the IPBus packet is bigger than therecv
buffer (128 bytes ~ 15 single read/write transactions), it is possible that the latency would be greatly increased.
I don't remember if I tried playing with the recv
buffer, because that definitely sounds like the type of limitation that was being hit...
The main thing is that I could bundle 1000s of transactions into a single dispatch()
when communicating with a GLIB, but on a CTP7, this always errored when the number of transactions was somewhere between 5 and 20.
The most basic test would be something like:
import uhal, itertools
# set up hwdevice as uhal device
reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
regvals = [hwdevice.getNode(r).read() for r in reglist]
hwdevice.dispatch()
for k,r in itertools.zip(reglist,regvals):
print("{:s}: 0x{:08x}".format(k,r))
You can extend this to do read/write mixed in the same dispatch:
import random
reglist = [list, of, registers, to, read, where, the, length, of, the, list, is, more, than, 10]
wvals = [random.randint(0x0, 0xffffffff) for r in reglist]
for w,r in itertools.izip(wvals,reglist):
hwdevice.getNode(r).write(w))
regvals.append(hwdevice.getNode(r).read())
hwdevice.dispatch()
for k,r,w in itertools.zip(reglist,regvals,wvals):
print("{:s}: 0x{:08x} (expected {:0x{08x})".format(k,r,w))
This could even be done for just a single register, by replacing reglist
with a single register that you then read/write multiple times in succession.
This is what the scripts I was using (back during the slice test to debug a specific FW issue) were doing, and they can now be (temporarily) found here(uhal
) and here(rwreg
)
I don't want you to put too much (of your valuable) time into understanding the performance differences between uhal
and native memsvc
read
/write
, as this ship, I think, has sailed... though I am still interested, so if you can find why the "multiple dispatch" doesn't work, I may try to continue some performance testing on my own, or create a task for a newcomer
The development branch moved to a templated RPC-only based solution.
Brief summary of issue
After killing the
ipbus
daemon on the CTP7, the program cannot be restarted immediately but one has to wait some time. During that period thebind
sycall fails with the "address in use" error. It can be seen that thenetstat
tool shows TCP connections in theLAST_ACK
state which progressively timeout.Types of issue
Expected Behavior
The
ipbus
daemon should not keep TCP connections open once the client disconnects.Current Behavior
While the daemon is running closed TCP connections are blocked in the
CLOSE_WAIT
state and theipbus
daemon consumes one full CTP7 CPU core after the first TCP connection is closed by the client.Steps to Reproduce (for bugs)
ipbus
daemon on the CTP7testConnectivity.py
for example)top
)netstat -an
; when writing this issue, 377 such connections are opened oneagle23
used for QC7)Possible Solution (for bugs)
When the
recv
call returns 0 the socket must beclose
d: https://github.com/cms-gem-daq-project/xhal/blob/d1f8b3d8efa7693418773a923f4c2388129130fb/xcompile/ipbus/Client.cpp#L51-L55Also @mexanick, I was wondering why you implemented this commit? If it was because connections were impossible once the maximum number of client was reached, fixing this issue should have the same effect.
Your Environment