DUNE-DAQ / dfmodules

Dataflow applications
3 stars 1 forks source link

Dataflow application crashes when configured with multiple data writers #246

Closed glehmannmiotto closed 1 year ago

glehmannmiotto commented 1 year ago

The dataflow app is often crashing during running if configured to run with multiple data writers and when the performance is being pushed. Most probably this is caused by a race condition in one of the libraries used by the data writer module (non thread safe?).

jcfreeman2 commented 1 year ago

Sounds like https://github.com/DUNE-DAQ/hdf5libs/issues/45. Haven't been able to recreate the crash (successfully writing to two disks on np04-srv-001 like when Eric ran into the problem) but I haven't yet pushed performance.

jcfreeman2 commented 1 year ago

One thing that may be an issue: even if having a single dataflow process writing data to multiple distinct files should logically be something which is thread safe, this may not be something which can be done safely in hdf5 (see https://www.youtube.com/watch?v=ydQNFt-NSvg, starting at 20:03). The good news is there's an option to build hdf5 in a thread safe manner, the bad news is we're not currently using it:

hdf5@1.12.0%gcc@12.1.0~cxx~fortran~hl~ipo~java~mpi+shared~szip~threadsafe+tools api=default build_type=RelWithDebInfo arch=linux-scientific7-broadwell

is the Spack spec of hdf5 if I set up our environment.

glehmannmiotto commented 1 year ago

I'm going to re-test it now that @dingp built the highfive library in thread safe mode.

jcfreeman2 commented 1 year ago

Just tried recreating runs 17971 and 17973 using /nfs/home/np04daq/DAQ_NP04_HD_DEV_AREA/configurations/np04_daq_TwoOutputDisk_TPC_conf. Note that I had to regenerate this configuration since it was deleted at some point in the last two weeks, but used the same candidate release as was used during integration week, rc-v3.2.1-1. This time, the writing went fine; runs 18129, 18130 and 18131 all successfully wrote two separate hdf5 files with a single dataflow app on np04-srv-001, to /data1 and /data2.

Since "absence of evidence doesn't imply evidence of absence", this doesn't necessarily mean the problem is solved. We could run the system harder (e.g., two dataflow apps writing out four hdf5 files) to try to create a crash. If and when we generate a crash then we can try out rc-v3.2.1-2 and see if that fixes the problem.

bieryAtFnal commented 1 year ago

I was able to see DF App crashes on mu2edaq13 with the following files and commands...

(dbt) [biery@mu2edaq13 rundir]$ cat hw_map.txt
# DRO_SourceID DetLink DetSlot DetCrate DetID DRO_Host DRO_Card DRO_SLR DRO_Link
0 0 0 1 3 localhost 0 0 0
1 1 0 1 3 localhost 0 0 1
(dbt) [biery@mu2edaq13 rundir]$ 
(dbt) [biery@mu2edaq13 rundir]$ cat daqconf_wib2_2df_swtpg_tpsw.json
{ "dataflow": {     "apps": [         { "app_name": "dataflow0", "output_paths": [".", "."] },         { "app_name": "dataflow1", "output_paths": [".", "."] }     ] }, "readout": { "data_file": "./wib2-frames.bin",    "clock_speed_hz": 62500000, "enable_software_tpg": true,   "data_rate_slowdown_factor": 10,   "readout_sends_tp_fragments": false }, "trigger": {   "enable_tpset_writing": true,   "trigger_activity_config": {"prescale":1000},   "trigger_window_before_ticks": 10000,   "trigger_window_after_ticks": 10000,   "trigger_rate_hz": 11.0 } }
(dbt) [biery@mu2edaq13 rundir]$ 
(dbt) [biery@mu2edaq13 rundir]$ daqconf_multiru_gen -c ./daqconf_wib2_2df_swtpg_tpsw.json --hardware-map-file ./hw_map.txt mdapp_wib2_1x2_swtpg_tpsw
[16:16:57] Parsing config json file ./daqconf_wib2_2df_swtpg_tpsw.json                                                   config_file.py:41
[16:16:58] Loading dataflow config generator                                                                        daqconf_multiru_gen:96
[16:17:01] Loading readout config generator                                                                        daqconf_multiru_gen:101
[16:17:04] Loading trigger config generator                                                                        daqconf_multiru_gen:103
[16:17:06] Loading DFO config generator                                                                            daqconf_multiru_gen:105
           Loading hsi config generator                                                                            daqconf_multiru_gen:107
[16:17:08] Loading fake hsi config generator                                                                       daqconf_multiru_gen:109
[16:17:09] Loading timing partition controller config generator                                                    daqconf_multiru_gen:111
[16:17:10] Loading DPDK sender config generator                                                                    daqconf_multiru_gen:113
           Loading TPWriter config generator                                                                       daqconf_multiru_gen:116
[16:17:12] Parsing dataflow app config {'app_name': 'dataflow0', 'token_count': 10, 'output_paths': ['.', '.'],    daqconf_multiru_gen:130
           'host_df': 'localhost', 'max_file_size': 4294967296, 'data_store_mode': 'all-per-file',                                        
           'max_trigger_record_window': 0}                                                                                                
           Parsing dataflow app config {'app_name': 'dataflow1', 'token_count': 10, 'output_paths': ['.', '.'],    daqconf_multiru_gen:130
           'host_df': 'localhost', 'max_file_size': 4294967296, 'data_store_mode': 'all-per-file',                                        
           'max_trigger_record_window': 0}                                                                                                
           Generating configs for hosts trigger=localhost DFO=localhost dataflow=['localhost', 'localhost']        daqconf_multiru_gen:158
           hsi=localhost dqm=['localhost']                                                                                                
           Will start a RU process on localhost reading card number 0, 2 links active                              daqconf_multiru_gen:176
[16:17:13] Generating system init command                                                                                conf_utils.py:752
           Generating system conf command                                                                                conf_utils.py:752
           Generating boot json file                                                                                     conf_utils.py:766
           Using a development area                                                                                      conf_utils.py:822
─────────────────────────────────────────────────────────── JSON file creation ───────────────────────────────────────────────────────────
           System configuration generated in directory 'mdapp_wib2_1x2_swtpg_tpsw'                                       conf_utils.py:797
[16:17:13] MDAapp config generated in mdapp_wib2_1x2_swtpg_tpsw                                                    daqconf_multiru_gen:677
[16:17:13] Generating metadata file                                                                                         metadata.py:10
(dbt) [biery@mu2edaq13 rundir]$ tmprun=401; runduration=30; waitAfterStop=2; local_backup log_*; nanorc mdapp_wib2_1x2_swtpg_tpsw/ ${USER}-test boot conf start_run ${tmprun} wait ${runduration} stop_run wait ${waitAfterStop} scrap terminate
jcfreeman2 commented 1 year ago

I've been recreating Kurt's run a few times in three different areas: 1) A N22-11-26- based area like Kurt was using, wherehdf5is built withoutthreadsafe 2) An area based on the latest nightly,N22-12-01, also wherehdf5is built withoutthreadsafe 3) An area based on the candidate releaserc-v3.2.1-2, wherehdf5was built _with_threadsafe`

First, the good news: for several runs in (1) and (2), during running one or both dataflow apps crash with some hdf5-related problem. In (3), I haven't seen this. For (1) and (2) sometimes you can even see the lack of thread safety (scroll to the bottom for an example).

Having said that, there are errors which appear in all three areas, which may be unrelated to thread safety. E.g., with the wib2-frames.bin file used, I reliably get the following message from the readout process:

2022-Dec-01 14:54:29,633 WARNING [void dunedaq::readoutlibs::FileSourceBuffer::read(const std::string&) at /cvmfs/dunedaq-development.opensciencegrid.org/nightly/N22-12-01/spack-0.18.1-gcc-12.1.0/spack-0.18.1/opt/spack/gcc-12.1.0/readoutlibs-N22-12-01-vwbib35emv4bvrv4ctuoaobyonoiqfa4/include/readoutlibs/utils/FileSourceBuffer.hpp:73] Configuration Error: Binary file contains more data than expected, filesize is 56160, chunk_size is 5664, filename is ./wib2-frames.bin

And also, for a certain fraction of time (30%?) the configure transition fails and the trigger process prints a message like Offline TPC Channel Number out of range

An example of a crash when we have a non-threadsafe hdf5 build: From mu2edaq13:/home/jcfree/daqbuild_N22-12-01/RunConf_406/log_dataflow1_3338.txt:

HDF5-DHDF5-DIAG: ErIAG: Error detected in rorHDF5 ( detected in 1.12.0) HDF5 (1.12.0) thread 0:
threa  #d 0:000:
/tmp/root/spack-stage/spack-stage-hdf5-1.12.0-rimexyeb4kqauyrgjofwt5wphckukcca/spack-src/src/H5VLnative_group.c   #000: /tmp/root/spack-stage/spack-stage-hdf5-1.12.0-rimexyeb4kqauyrgjofwt5wphckukcca/spack-src/src/H5VLnative_group.c liline ne 7474 in H5VL__native_group_create( in H5VL__native_group_create(): )unable to create group
jcfreeman2 commented 1 year ago

tl;dr : I can't get the dataflow app to crash while writing to two separate paths as long as it uses the rc-v3.2.1-2 release, so I think that the +threadsafe build of hdf5 that rc-v3.2.1-2 uses is the solution to this Issue.

I created a 5 hz version of the 1 hz configuration which crashed runs 17971 and 17973 during integration week. As you'll recall from earlier in this Issue I couldn't recreate the crashes at 1 hz, but increasing the rate to 5 hz I'm reliably able to get a near-immediate dataflow crash shortly after start when running from DAQ_NP04_HD_DEV_AREA.

This all changes when I use a workarea based on the rc-v3.2.1-2 candidate release (specifically, /nfs/sw/work_dirs/jcfree/TwoOutputDisk_Studies/daqbuild_rc-v3.2.1-2 on np04). With both runs 18177 and 18178 I've been able to have a dataflow app simultaneously write to both /data1 and /data2. At this point I'm willing to say that the +threadsafe build of hdf5 fixes our problem.

dingp commented 1 year ago

Thanks John! I'll have the nightlies use +threadsafe variant of hdf5. The earliest one with it will be N22-12-03.