Closed asmacdo closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.97%. Comparing base (
2b12679
) to head (08dcaeb
). Report is 24 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
On typhon with singularity (with and without containers run), prior to this PR we got divergent results that clearly showed that duct sample aggregation was not working as expected.
I've rerun those experiments, and now the 2 ways of using singularity with duct are returning very similar outputs.
"execution_summary": {
"exit_code": 0,
"command": "singularity run --contain --bind /home/asmacdo/devel/sandbox/mriqc-sanity/sourcedata:/data:ro --bind /home/asmacdo/devel/sandbox/mriqc-sanity/dfix-sr-1:/out --bind /home/asmacdo/devel/sandbox/mriqc-sanity/dfix-sr-1/workdir:/workdir code/containers/images/bids/bids-mriqc--0.16.0.sing /data /out participant --participant-label 02 -w /workdir --no-sub",
"logs_prefix": "dfix-sr-1/duct_",
"wall_clock_time": 381.45661306381226,
"peak_rss": 8019558400,
"average_rss": 6542514699.13043,
"peak_vsz": 23663247360,
"average_vsz": 20365962395.826088,
"peak_pmem": 0,
"average_pmem": 0,
"peak_pcpu": 3438.4,
"average_pcpu": 332.87663043478267,
"num_samples": 368,
"num_reports": 8
},
"execution_summary": {
"exit_code": 0,
"command": "./code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-mriqc--0.16.0.sing sourcedata dfix-cr-1 participant --participant-label 02 -w dfix-cr-1/workdir --no-sub",
"logs_prefix": "dfix-cr-1/duct_",
"wall_clock_time": 377.5985689163208,
"peak_rss": 8124411904,
"average_rss": 6525882558.772606,
"peak_vsz": 23824408576,
"average_vsz": 20357140132.120544,
"peak_pmem": 0,
"average_pmem": 0,
"peak_pcpu": 3035.2,
"average_pcpu": 335.2660273972601,
"num_samples": 365,
"num_reports": 8
},
somewhat unrelated I guess: "peak_pcpu": 3438.4,
looks odd, even for typhon with 32 cores. Having a plot would have been useful to grasp situation on that
Using the poc https://github.com/asmacdo/fancy-duct (2 lines had to change, fixed upstream)
The case I can think of that could artificially inflate the peak would be if the processes were trading resources back and forth, but the aggregation was totaling the peaks of both. So I added a test for that case, which passes.
IIUC (still could be wrong) this means that the 3438% spike was actually recorded by a single call to ps
and the individual processes actually somehow added up to more than numcores*100
.
FWIW the record of the spike:
and indeed, if I go through and add the peaks of each pid as listed above, we get an even higher number
In [25]: for pid, stats in data['processes'].items():
...: for k, v in stats.items():
...: if k == 'pcpu':
...: tot += v
...:
In [26]:
In [26]: tot
Out[26]: 4000.2999999999993
Please wrap long dumps into <details>
next time. So was that on Typhon? If so, let's consider ok and proceed
Yes on typhon
This began as work to add
stat
, which as a dumb initial implementation was just a list of all the values we get. (to be replaced by a counter). However, the sample aggregation returned strange results-- the averages reported the correct number of samples, butstat
was only showing a list of length 1.So this PR refactors the complexity of sample aggregation be abstracted by a Report function
update_from_sample
. This will be much more testable.DONE test
update_from_sample
~DONE: since
aggregate
(wasmax
) is now updating the averages, we can dropReport.averages
, since that information is duplicated byreport.max_values.averages
, which should be renamed toreport.full_execution_stats
(I don't like that name) or similar.At the end of the run:
I've removed the stat update, to add separately in another PR