art-daq / artdaq_daqinterface

Other
0 stars 1 forks source link

It should be possible to recover details about why a process didn't launch #60

Closed eflumerf closed 2 years ago

eflumerf commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/21739 (FNAL account required) Originally created by @jcfreeman2 on 2019-01-22 23:04:31


Right now, in direct process management, if a process (or set of processes) doesn't launch on a node, it can be difficult to determine the cause. To avoid a verbose spew to stdout that would overwhelm everything else (especially if a large number of processes are desired) unless we're at the highest debug level (currently 4) the output of the source of the setup script and the launch of the artdaq processes is suppressed. The downside of this is that if the source of the setup script returns nonzero or an artdaq process doesn't launch, the reason for this gets suppressed. Two real-world examples of this include the source of a setupARTDAQDEMO script returning nonzero for the following reason:

/home/jcfree/artdaq-demo_dec19/setupARTDAQDEMO: line 61: tonMg: command not found

and none of the processes launching on mu2edaq11:

boardreader: error while loading shared libraries: libsqlite3_ups.so.0: cannot open shared object file: No such file or directory

where it's important to note that the latter error does not make it into the MessageFacility logfile - in fact, no MessageFacility logfile gets created at all for the unlaunched process.

A (temporary) record of what happened if something goes wrong with a process launch should be saved, and in the event that something goes wrong, users should be pointed to it. If processes launch without a problem, the record should be deleted.

eflumerf commented 2 years ago

Comment by @jcfreeman2 on 2019-01-23 22:21:51


With commit 962ff8fba8d1e8553e69b72c963ec619f42d1fec on the develop branch, if we're running in direct process management mode, then when processes are launched the output on each host is saved in a file called :/launchattempt_partition. If the processes don't all launch successfully, for each host where things weren't successful the user is pointed to the output file. Note that this file doesn't just contain MessageFacility output, it also contains stdout and stderr, and hence captures the examples given above. It's important to note that for MessageFacility console output we'll want to make the threshold strict (say, errors only, not warning or info) so there isn't a performance hit from output being directed to these files.