Open jgphpc opened 8 years ago
Attached two images for the placement of the two runs on the machine, organized by cabinet X,Y, cages, slots and cpu:
Obviously far too little data for statistics, but in both cases the failure was in the middle cabinet row, in the middle cage and in one of the upper slots. Also interesting to see how the scheduler scatters the app across the machine ☺
The script can be used as (requiring DaintTopo.txt):
~messmerp/projects/crayvis/crayvis.py nodes.txt image.png failureNode1 failureNode2 …
where nodes.txt is the node file of the run, image.png is the output file and failureNode1 etc are the Failing nodes displayed in red. This is still in its infancy, but could be useful for quickly looking at the placement statistics.
@pmessmer here is an example of nodelist for 2 jobs who failed with cuStreamSynchronize failed: unknown_error (aka fallen off the bus error):
xtprocadmin
should give the position of each node in the system for instance for nid05373Is it enough for you to start ?