DUNE-DAQ / minidaqapp

0 stars 1 forks source link

Add software TP generation and TP->TA->TC chain in trigger #70

Closed philiprodrigues closed 3 years ago

philiprodrigues commented 3 years ago

This PR adds software TP generation to the readout app, and reception of those TPs in the trigger app. The trigger app sends the TPs through a TP -> TA -> TC chain which is fed into the MLT.

The PR includes buffering and data requests for the "raw" TPs from the readout app, but no buffering of TPs in the trigger app.

The default frames.bin file won't give enough TPs for any useful tests. Better results will be obtained with the larger file mentioned here:

https://github.com/DUNE-DAQ/readout/#enabling-the-software-tpg

For my tests I used:

python -m minidaqapp.nanorc.mdapp_multiru_gen -n 2 -s 10 -t 1 --host-ru localhost --data-file felix-2020-06-02-093338.0.0.0.bin --enable-software-tpg json

The number of triggers reported in the trigger app log is sensible, and matches the number of TRs I see in the output file. Each TR has four fragments - two from the 2 raw data outputs (which have the same size across TRs) and two from the TPs (different sizes across TRs).

bieryAtFnal commented 3 years ago

With multiple runs in a single DAQ session, it seems that the RawTP Fragments do not appear in the data files from the second and third runs.

Here are the commands that I used to generate the configuration and to run the system:

To check on the number of fragments, I used our hdf5_dump tool:

Here is the sample output from one of my runs:

Reading file swtest_run000103_0000_biery_20210802T163604.hdf5
---------------------Column Definitions---------------------
i:           Trigger record number;
N_frag_exp:  expected no. of fragments stored in header;
N_frag_act:  no. of fragments written in trigger record;
N_diff:      N_frag_act - N_frag_exp
---------------------Column Definitions---------------------
    i       N_frag_exp     N_frag_act     N_diff  
    1            4              2           -2    
    2            4              2           -2    
    3            4              2           -2    
    4            4              2           -2    
    5            4              2           -2    
    6            4              2           -2    
    7            4              2           -2    
    8            4              2           -2    
    9            4              2           -2    
    10           4              2           -2    

(The N_diff values should be zero.)

floriangroetschla commented 3 years ago

I added a new smaller (1MB) frames.bin that produces continuous hits and can be used for testing. The link in the minidaqapp wiki for the 2.8 setup was updated. While testing I ran into similar problems as Kurt, although I can also reproduce them in the same run by pausing and resuming again. After resume is called a second time only fake triggers are issued.

philiprodrigues commented 3 years ago

I fixed some problems in DS with handling pause/resume and stop/start, in https://github.com/DUNE-DAQ/trigger/pull/74 . I expect that will fix some of the issues described in this PR. If you want to try that branch now, that would be great, or you can wait until https://github.com/DUNE-DAQ/trigger/pull/74 is merged. You'll also want the two commits I just pushed to the branch in this PR

bieryAtFnal commented 3 years ago

I tried the latest commits on this branch (minidaqapp::benland100/tp_alg_chain) along the the changes in the trigger repo (branch == philiprodrigues/stop-start) and the latest commits to the develop branch in the readout repo (commit hash e5df7578), and I see that three files created by 3 sequential runs in a single DAQ session all have the same size. So, it would seem that the 2nd-run problem is fixed.

Something that I hadn't fully appreciated before this current set of tests is that there are additional triggers (and therefore additional TriggerRecords) when TPs are enabled.

When TPs are enabled, the TPC/WIB Fragment sizes are typically 544 bytes in the non-periodic TriggerRecords (compared to 37200 in the periodic TriggerRecords). However, I see occasional Fragment sizes of 1008 bytes for WIB Fragments in the non-periodic TRs. Are these two values expected (544 usually, but 1008 sometimes)?

bieryAtFnal commented 3 years ago

Trying a slightly more stressful system configuration (3 Readout Apps), I see a problem on the third run in a DAQ session.

Here are the commands:

In this test, nanorc complains about "Timeout while waiting for a reply from ruemu2 for command stop" at the end of the third run. I've also seen the problem happen in the second run (rather than the third).

In the failing run, the DF TriggerRecordBuilder complains about Fragments from the previous run. I haven't checked to see which module or queue might be delivering those Fragments "too late"...

philiprodrigues commented 3 years ago

Something that I hadn't fully appreciated before this current set of tests is that there are additional triggers (and therefore additional TriggerRecords) when TPs are enabled.

If you need fewer (or more) triggers-from-TPs, you can adjust the prescale with -A 'dict(prescale=NUMBER)' (larger for fewer triggers). You'll still get at least one trigger, because the prescale is coded up such that the first TP (and TA) are chosen. If that's a problem, it would be straightforward to change

When TPs are enabled, the TPC/WIB Fragment sizes are typically 544 bytes in the non-periodic TriggerRecords (compared to 37200 in the periodic TriggerRecords). However, I see occasional Fragment sizes of 1008 bytes for WIB Fragments in the non-periodic TRs. Are these two values expected (544 usually, but 1008 sometimes)?

544 == 80 + 464 and 1008 == 80 + 2*464, where 80 is the fragment header size, and 464 is the size of one WIB frame, so these are the appropriate fragment sizes for triggers containing 1 or 2 WIB frames. I looked at the prescale code, and the window it's requesting is the width of the TP that was selected by the prescale, so narrow windows are consistent with that. But there aren't many hits that are just one tick long. The readout code is missing a conversion between TPC ticks and timestamps (https://github.com/DUNE-DAQ/readout/issues/116). When that is fixed, the fragments will still have sizes 80 + n * 464, but now with n more typically in the region of "several tens" or hundreds.

If it's more useful to have the prescale trigger produce fixed-width windows, we can discuss that.

philiprodrigues commented 3 years ago

https://github.com/DUNE-DAQ/trigger/pull/74 is merged, so develop of trigger should now be good for testing this PR

bieryAtFnal commented 3 years ago

With the latest readout, trigger, etc. code, I still see problems with multiple Readout Apps when SWTPGs are enabled. Moreover, I noticed some problems at the end of a single run, so maybe we should focus on that first.

The symptom is that the last several TriggerRecords in the run are missing fragments. (sample commands below)

Some questions (probably for @floriangroetschla ):

Here are the commands that I used to demonstrate the issue:

curl -o frames.bin -O https://cernbox.cern.ch/index.php/s/7qNnuxD8igDOVJT/download
python -m minidaqapp.nanorc.mdapp_multiru_gen --host-ru localhost --host-ru localhost --host-ru localhost -d $PWD/frames.bin -o . -s 10 --enable-software-tpg mdapp_6proc_swtpg
nanorc mdapp_6proc_swtpg boot init conf start 301 resume wait 30 stop wait 2 scrap terminate
tonM -n DefaultRequestHandlerModel DEBUG+11
tonM -n EmptyFragmentRequestHandlerModel DEBUG+11
nanorc mdapp_6proc_swtpg boot init conf start 401 resume wait 30 stop wait 2 scrap terminate
$DFMODULES_FQ_DIR/dfmodules/bin/hdf5dump/hdf5_dump.py -c -f swtest_run000401*.hdf5
tshow | tdelta -ct 1 | grep 'number 530' | wc -l
tshow | tdelta -ct 1 | grep 'number 540' | wc -l

The trigger numbers in the last two commands may need to be tweaked, depending on how many TRs there were in the run...

mroda88 commented 3 years ago

Hi all, I did a some tests. I'm going to post here the plots of the metrics and then I'll try to draw some conclusions

Screenshot from 2021-08-04 18-29-26 Screenshot from 2021-08-04 18-30-35

About the fragments. I also see the missing fragments: they always happen when the stop happens. Normally, I would not be worried since we have seen this happens, as it's due to the propagation of the commands. But there is definitely something more happening here. Some of the fragments missing from the previous run are appearing in the second run, and they are correctly considered Unexpected, but they are less then the one lost from the previous run, i.e. the red line in the bottom center plot is never as high as the yellow. Normally, all of them were showing up as unexpected in the next run. In addition, it seems there are some requests that are not satisfied, from some links, as one as in the second plot, there is always a link handler (or more) having 0 satisfied requests. I need to change the dashboard to see each link handler separately, then I'll be able to have more details. Furthermore, the time of the event building skyrockets at the end, that is probably because the missing fragments are spread around the run so definitely the missing fragments are not only stop related.

Second run issues (maybe issue is a too strong word here) The data requests width are not constant now. At the beginning of the second run they were twice as much as the previous case, and then they tend to stay higher than the previous run (first pics, bottom left plot). Is it just a coincidence? Or do we think there is more going on here?

mroda88 commented 3 years ago

Quick update. The non satisfied request from the data link handler are not from my run, they are from another partition. Sorry about that. Still a few things are still consistent: not all the fragments lost from a previous run are recovered. Also the change in the behaviour of the request time is not the same when a new run starts: it increases and the response time increase as well.

mroda88 commented 3 years ago

Maybe we can test with a single token to better understand what happens. I'll try tomorrow, here it's a bit late now

mroda88 commented 3 years ago

Ah, I thought it was a coincidence, but I notice now it always happens: the 3rd run is always short, then it stops completely as all the trigger are inhibited.

roland-sipos commented 3 years ago

The stop sequence is causing all the problems, and I'm not sure that the readout can do much about it. We receive STOP later than the TRB, and then if those requests are "waiting" or queued in, what do we do with those? We drop them? We send empty fragments? Doesn't matter, as it seems you will receive those on the next run...

What does the TRB do when the stop issued? Does it wait for pending TRs to be finished or not?

We should discuss this on the call today.

roland-sipos commented 3 years ago

Just for curiosity: @mroda88 , during your tests, did you do: start -> resume -> pause -> stop -> start

or went directly to stop like: start -> resume -> stop -> start?

mroda88 commented 3 years ago

Hi all. I did the test with the single token: the run is perfectly stable, so no fragments are lost during the run, it's only the stop transition that shows the issue. Which I believe is a big relief. @roland-sipos I tried both, the fragments are only missing when I skip the pause. Which is ok. To reply to your previous question, the TRB does not wait to complete the TR, it simply drains the queues, and it tries to complete the TRs with the available fragments in the queues. When they are empty, the remaining incomplete TRs are sent to writing. My personal idea is that missing fragments in the file at the end of the run are kind of unavoidable in a distributed system. I think the concern from Kurt is simply a matter of logging: even if the fragments are lost it would be nice to be sure of where each fragment was lost.

Last but not least, in fact this is I believe is the big problem. Regardless of the usage of pause or not, the second run always shows a slower building time, a slower trigger rate and larger data request width, and the third run always stops almost immediately as no fragments are sent out and any attempt to call stop timesout on the ruemu apps.

mroda88 commented 3 years ago

Actually, correction about the 3rd run failure. The second run, either with pause or not, shows different behaviour in the responses and in the requests than the very first run. In this condition, if I pause -> stop -> start -> resume, I obtain a valid run with the same behaviour as a "second" run: larger request width and slower replies. And I can go on with this and reached 5th run (I didn't try more). But as soon as I stop without the pause and I try to start again, the run will halt very soon and no fragments are send out anymore

floriangroetschla commented 3 years ago

I pushed some changes to develop of readout that change the stop behavior. For all requests that are still waiting we send out empty fragments now. It is still possible that new requests arrive after readout is stopped though. The warning about timed out requests came from the EmptyFragmentRequestHandler that the TP DLHs use to always send empty fragments. I removed the warning because the requests were not actually timing out and the message was confusing.

I also see that the trigger rate is lower starting with the second run, my changes didn't seem to affect this issue.

mroda88 commented 3 years ago

@floriangroetschla thanks. I can give it a try later. I have another meeting now. Hopefully I'll manage before the meeting today

mroda88 commented 3 years ago

@floriangroetschla good news. I don't know exactly what you did, but with your fix I haven't managed to get into the condition of the "third" run, which is no fragments generated and run stuck. As you say, the second run has the same features as before. Some fragments at stop are still missing as before, but as of now, I'm not too concerned about this.

bieryAtFnal commented 3 years ago

We believe that the changes on this branch in this repository are ready to go. There have been changes in C++ code to fix some issues, and those fixes may continue for a little while, but we feel that the config changes are now stable.

We suspect that some of the problems that we seen in testing were the result of multiple people running on the same computers. And, we feel that the recommended practice of always Pausing a run before Stopping it helps the system perform well.