Closed glehmannmiotto closed 1 year ago
Sounds like https://github.com/DUNE-DAQ/hdf5libs/issues/45. Haven't been able to recreate the crash (successfully writing to two disks on np04-srv-001
like when Eric ran into the problem) but I haven't yet pushed performance.
One thing that may be an issue: even if having a single dataflow process writing data to multiple distinct files should logically be something which is thread safe, this may not be something which can be done safely in hdf5 (see https://www.youtube.com/watch?v=ydQNFt-NSvg, starting at 20:03). The good news is there's an option to build hdf5 in a thread safe manner, the bad news is we're not currently using it:
hdf5@1.12.0%gcc@12.1.0~cxx~fortran~hl~ipo~java~mpi+shared~szip~threadsafe+tools api=default build_type=RelWithDebInfo arch=linux-scientific7-broadwell
is the Spack spec of hdf5 if I set up our environment.
I'm going to re-test it now that @dingp built the highfive library in thread safe mode.
Just tried recreating runs 17971 and 17973 using /nfs/home/np04daq/DAQ_NP04_HD_DEV_AREA/configurations/np04_daq_TwoOutputDisk_TPC_conf
. Note that I had to regenerate this configuration since it was deleted at some point in the last two weeks, but used the same candidate release as was used during integration week, rc-v3.2.1-1
. This time, the writing went fine; runs 18129, 18130 and 18131 all successfully wrote two separate hdf5 files with a single dataflow app on np04-srv-001, to /data1
and /data2
.
Since "absence of evidence doesn't imply evidence of absence", this doesn't necessarily mean the problem is solved. We could run the system harder (e.g., two dataflow apps writing out four hdf5 files) to try to create a crash. If and when we generate a crash then we can try out rc-v3.2.1-2
and see if that fixes the problem.
I was able to see DF App crashes on mu2edaq13 with the following files and commands...
(dbt) [biery@mu2edaq13 rundir]$ cat hw_map.txt
# DRO_SourceID DetLink DetSlot DetCrate DetID DRO_Host DRO_Card DRO_SLR DRO_Link
0 0 0 1 3 localhost 0 0 0
1 1 0 1 3 localhost 0 0 1
(dbt) [biery@mu2edaq13 rundir]$
(dbt) [biery@mu2edaq13 rundir]$ cat daqconf_wib2_2df_swtpg_tpsw.json
{ "dataflow": { "apps": [ { "app_name": "dataflow0", "output_paths": [".", "."] }, { "app_name": "dataflow1", "output_paths": [".", "."] } ] }, "readout": { "data_file": "./wib2-frames.bin", "clock_speed_hz": 62500000, "enable_software_tpg": true, "data_rate_slowdown_factor": 10, "readout_sends_tp_fragments": false }, "trigger": { "enable_tpset_writing": true, "trigger_activity_config": {"prescale":1000}, "trigger_window_before_ticks": 10000, "trigger_window_after_ticks": 10000, "trigger_rate_hz": 11.0 } }
(dbt) [biery@mu2edaq13 rundir]$
(dbt) [biery@mu2edaq13 rundir]$ daqconf_multiru_gen -c ./daqconf_wib2_2df_swtpg_tpsw.json --hardware-map-file ./hw_map.txt mdapp_wib2_1x2_swtpg_tpsw
[16:16:57] Parsing config json file ./daqconf_wib2_2df_swtpg_tpsw.json config_file.py:41
[16:16:58] Loading dataflow config generator daqconf_multiru_gen:96
[16:17:01] Loading readout config generator daqconf_multiru_gen:101
[16:17:04] Loading trigger config generator daqconf_multiru_gen:103
[16:17:06] Loading DFO config generator daqconf_multiru_gen:105
Loading hsi config generator daqconf_multiru_gen:107
[16:17:08] Loading fake hsi config generator daqconf_multiru_gen:109
[16:17:09] Loading timing partition controller config generator daqconf_multiru_gen:111
[16:17:10] Loading DPDK sender config generator daqconf_multiru_gen:113
Loading TPWriter config generator daqconf_multiru_gen:116
[16:17:12] Parsing dataflow app config {'app_name': 'dataflow0', 'token_count': 10, 'output_paths': ['.', '.'], daqconf_multiru_gen:130
'host_df': 'localhost', 'max_file_size': 4294967296, 'data_store_mode': 'all-per-file',
'max_trigger_record_window': 0}
Parsing dataflow app config {'app_name': 'dataflow1', 'token_count': 10, 'output_paths': ['.', '.'], daqconf_multiru_gen:130
'host_df': 'localhost', 'max_file_size': 4294967296, 'data_store_mode': 'all-per-file',
'max_trigger_record_window': 0}
Generating configs for hosts trigger=localhost DFO=localhost dataflow=['localhost', 'localhost'] daqconf_multiru_gen:158
hsi=localhost dqm=['localhost']
Will start a RU process on localhost reading card number 0, 2 links active daqconf_multiru_gen:176
[16:17:13] Generating system init command conf_utils.py:752
Generating system conf command conf_utils.py:752
Generating boot json file conf_utils.py:766
Using a development area conf_utils.py:822
─────────────────────────────────────────────────────────── JSON file creation ───────────────────────────────────────────────────────────
System configuration generated in directory 'mdapp_wib2_1x2_swtpg_tpsw' conf_utils.py:797
[16:17:13] MDAapp config generated in mdapp_wib2_1x2_swtpg_tpsw daqconf_multiru_gen:677
[16:17:13] Generating metadata file metadata.py:10
(dbt) [biery@mu2edaq13 rundir]$ tmprun=401; runduration=30; waitAfterStop=2; local_backup log_*; nanorc mdapp_wib2_1x2_swtpg_tpsw/ ${USER}-test boot conf start_run ${tmprun} wait ${runduration} stop_run wait ${waitAfterStop} scrap terminate
I've been recreating Kurt's run a few times in three different areas:
1) A N22-11-26- based area like Kurt was using, where
hdf5is built without
threadsafe 2) An area based on the latest nightly,
N22-12-01, also where
hdf5is built without
threadsafe 3) An area based on the candidate release
rc-v3.2.1-2, where
hdf5was built _with_
threadsafe`
First, the good news: for several runs in (1) and (2), during running one or both dataflow apps crash with some hdf5
-related problem. In (3), I haven't seen this. For (1) and (2) sometimes you can even see the lack of thread safety (scroll to the bottom for an example).
Having said that, there are errors which appear in all three areas, which may be unrelated to thread safety. E.g., with the wib2-frames.bin
file used, I reliably get the following message from the readout process:
2022-Dec-01 14:54:29,633 WARNING [void dunedaq::readoutlibs::FileSourceBuffer::read(const std::string&) at /cvmfs/dunedaq-development.opensciencegrid.org/nightly/N22-12-01/spack-0.18.1-gcc-12.1.0/spack-0.18.1/opt/spack/gcc-12.1.0/readoutlibs-N22-12-01-vwbib35emv4bvrv4ctuoaobyonoiqfa4/include/readoutlibs/utils/FileSourceBuffer.hpp:73] Configuration Error: Binary file contains more data than expected, filesize is 56160, chunk_size is 5664, filename is ./wib2-frames.bin
And also, for a certain fraction of time (30%?) the configure transition fails and the trigger process prints a message like
Offline TPC Channel Number out of range
An example of a crash when we have a non-threadsafe
hdf5 build:
From mu2edaq13:/home/jcfree/daqbuild_N22-12-01/RunConf_406/log_dataflow1_3338.txt
:
HDF5-DHDF5-DIAG: ErIAG: Error detected in rorHDF5 ( detected in 1.12.0) HDF5 (1.12.0) thread 0:
threa #d 0:000:
/tmp/root/spack-stage/spack-stage-hdf5-1.12.0-rimexyeb4kqauyrgjofwt5wphckukcca/spack-src/src/H5VLnative_group.c #000: /tmp/root/spack-stage/spack-stage-hdf5-1.12.0-rimexyeb4kqauyrgjofwt5wphckukcca/spack-src/src/H5VLnative_group.c liline ne 7474 in H5VL__native_group_create( in H5VL__native_group_create(): )unable to create group
tl;dr : I can't get the dataflow app to crash while writing to two separate paths as long as it uses the rc-v3.2.1-2
release, so I think that the +threadsafe
build of hdf5
that rc-v3.2.1-2
uses is the solution to this Issue.
I created a 5 hz version of the 1 hz configuration which crashed runs 17971 and 17973 during integration week. As you'll recall from earlier in this Issue I couldn't recreate the crashes at 1 hz, but increasing the rate to 5 hz I'm reliably able to get a near-immediate dataflow crash shortly after start when running from DAQ_NP04_HD_DEV_AREA.
This all changes when I use a workarea based on the rc-v3.2.1-2
candidate release (specifically, /nfs/sw/work_dirs/jcfree/TwoOutputDisk_Studies/daqbuild_rc-v3.2.1-2
on np04
). With both runs 18177 and 18178 I've been able to have a dataflow app simultaneously write to both /data1
and /data2
. At this point I'm willing to say that the +threadsafe
build of hdf5
fixes our problem.
Thanks John! I'll have the nightlies use +threadsafe
variant of hdf5
. The earliest one with it will be N22-12-03
.
The dataflow app is often crashing during running if configured to run with multiple data writers and when the performance is being pushed. Most probably this is caused by a race condition in one of the libraries used by the data writer module (non thread safe?).