eth-cscs / pyfr

pyfr@cscs (https://github.com/vincentlab/PyFR)
0 stars 0 forks source link

Placement of job on nodes/gpus #11

Open jgphpc opened 8 years ago

jgphpc commented 8 years ago

@pmessmer here is an example of nodelist for 2 jobs who failed with cuStreamSynchronize failed: unknown_error (aka fallen off the bus error):

xtprocadmin should give the position of each node in the system for instance for nid05373

5373   0x14fd c7-2c2s15n1

Is it enough for you to start ?

pmessmer commented 8 years ago

Attached two images for the placement of the two runs on the machine, organized by cabinet X,Y, cages, slots and cpu:

Obviously far too little data for statistics, but in both cases the failure was in the middle cabinet row, in the middle cage and in one of the upper slots. Also interesting to see how the scheduler scatters the app across the machine ☺

The script can be used as (requiring DaintTopo.txt):

~messmerp/projects/crayvis/crayvis.py nodes.txt image.png failureNode1 failureNode2 …

where nodes.txt is the node file of the run, image.png is the output file and failureNode1 etc are the Failing nodes displayed in red. This is still in its infancy, but could be useful for quickly looking at the placement statistics.

jgphpc commented 8 years ago

job498868

im498868

job464938

im464938