art-daq / artdaq

Other
0 stars 3 forks source link

Add acknowledgements to Request protocol #136

Closed eflumerf closed 2 years ago

eflumerf commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/22532 (FNAL account required) Originally created by @eflumerf on 2019-05-08 02:19:22


There should be an option in RequestReceiver and support in RequestSender for acknowledgements of requests (instead of the RequestSender owner informing RequestSender when a given sequence ID is complete). This would help make the Request protocol have the same reliability as the Routing Table protocol.


Related issues:


Related issues:

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-05-15 18:45:50


I started looking at this, and some associated changes to the Request messages, and realized that to fully implement this, RequestSender is going to have to know what ranks to expect acknowledgements from. There are several strategies for doing this, but I would like to make sure that the solution we implement will work for DUNE, where the sending ranks may come and go, and/or each request may have a different subset of senders.

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-05-15 18:49:24


I've put my initial code changes on artdaq:feature/22532_RequestAcknowledgements

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-05-30 17:02:19


There is a feature in artdaq_core needed for this: artdaq-core:feature/22532_TimeUtils_ElapsedTimeFromTimespec.

I have tested this using unit tests and a set of simple requestSender/requestReceiver tests.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-06-06 17:01:51


Hi Eric, I have some questions about the RequestMessage Constructor (https://cdcvs.fnal.gov/redmine/projects/artdaq/repository/entry/artdaq/DAQrate/detail/RequestMessage.hh?rev=feature%2F22532_RequestAcknowledgements#L466)...

The first thing that I noticed was that when I tried to compile the code on this branch, is that the compiler complained about unused variables "size" and "end" in the RequestMessage Constructor. I see that these variable are used in "assert" statements, and I'm guessing that the assert statements are compiled out in profile builds, so those variable end up not being used.

A few questions:

  1. If we add a TLOG(TLVL_ERROR) statement when the size that is passed in is too small, is that enough? Should we return early? Throw an exception?
  2. is the loop on lines 480-485 missing an increment of the "ptr" variable to point to the next packet?
  3. if we add a simple "if" statement check on the values for ptr and end (in addition to the assert on line 482), I presume that a TLOG(TLVL_ERROR) message would be appropriate. What else? Skip the creation of any remaining RequestPacket instances?

Thanks, Kurt

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-06-06 17:18:27


  1. A TLOG would probably be okay, as well as returning a marked-as-invalid packet.
  2. No, ptr is incremented in the RequestPacket constructor. This is bad code and should probably be changed (or at least thoroughly documented).
  3. Yes, this would be at the very least a warning about mismatched received message size and contents. Once the end has been reached, any further accesses would be a segfault, so it should abort in that case.
eflumerf commented 2 years ago

Comment by @eflumerf on 2019-06-06 17:26:05


I just remembered that the reason for the bad code in 2 is because the ptr is not incremented by sizeof(RequestPacket) due to an extra word being inserted into the serialized object to represent the number of bytes in the VectorBitset...

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-06-07 15:32:53


Eric, I've committed a new version of RequestMessage.hh that has candidate changes for the issues that we've discussed. In addition, I created a RequestMessage_t.cc unit test file to (hopefully) test the changes that I've made.

I was hoping to keep the assert statements in the RequestMessage constructor, but I didn't know how to avoid having them confuse the unit tests, so in the end, I took them out. And as we discussed, there are now TRACE error messages and appropriate internal settings that (again hopefully) accurately represent the internal state of the RequestMessage. The unit tests help to verify that.

For the incrementing of the "ptr" variable, I added an attribute to RequestPacket that will provide the size of the de-serialized message. And, I used that in decoding the full RequestMessage. This is to take into account your point that a simple sizeof(blah) call would not be sufficient for incrementing "ptr" in the RequestMessage constructor.

Please take a look, when you get a chance.
With these changes, all unit tests pass on the feature/22532_RequestAcknowledgements branch. Thanks, Kurt

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-06-07 16:09:38


I have reviewed the added code and run the new test. I have some ideas for further unit tests, which I will add to a separate Issue.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-06-14 16:47:06


Hi Eric, Are there configuration changes that should be made to keep up with these code change(s)? I suspect not, but I wanted to ask. In an incremental test of switching a routing-master-based system to use this new code (actually to use the merged working/22535_protoDFO_testing branch), the push-mode BR complains that it doesn't have routing information. Looking at a pull-mode BR, I don't see table updates being received in the TRACE log. I'm sure that I'm missing something, so any hints that you have would be appreciated.
To be clear, the only change that I've made to a config that works with artdaq v3_05_00 is to move the routing_token_port parameter in the RM config into a token_receiver block. As I say, there may be other changes that I need to make, but I'm forgetting them. Thanks, Kurt

eflumerf commented 2 years ago

Comment by @eflumerf on 2019-06-14 17:01:00


IIRC, the only new configuration parameters have default values that make the system perform as it did before the change. Do RequestSender_t and requestsender -c test/DAQrate/requestsender.fcl work as expected?

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-06-25 21:02:24


RequestSender_t was working as expected.

I found the problem... I needed to merge bugfix/22267_RoundRobin_MinParticipantsFix into the working/22535_protoDFO_pduneHacks branch. So, it was completely unrelated to this Issue.