Open mdzik opened 1 year ago
Thanks for sharing this info.
Ad 1: Can you share some performance data? Eg. how it compares to nVidia?
Ad 2: I think for now compilation of R is the only option. Thankfully I did not have problem with compiling R on any architecture yet.
Ad 3: It's interesting. Setting gpu_oversubscribe
to true means TCLB sees only one card. Did you bind GPUs or set ROCM_VISIBLE_GPU
? Can you share sbatch
scripts?
Ad 4: I remember you were doing some ADIOS integration. How does ADIOS relate to for example Catalyst? And how does it relate to HDF5? Feel free to start another issue on this subject, as I think this doesn't relate much to HIP/LUMI
cat run.slurm
#!/bin/bash -l
. ~/tclb_env.sh
BIN="/users/XXXX/TCLB/CLB/auto_porous_media_d3q19_TRT_GinzburgEqOrd1_FlowInZ/main ./RunMe.xml"
srun /users/XXXX/select_gpu.sh $BIN
/users/XXX/select_gpu.sh:
#!/bin/bash
GPUSID="4 5 2 3 6 7 0 1"
GPUSID=(${GPUSID})
if [ ${#GPUSID[@]} -gt 0 ]; then
export ROCR_VISIBLE_DEVICES=${GPUSID[$((SLURM_LOCALID / ($SLURM_NTASKS_PER_NODE / ${#GPUSID[@]})))]}
fi
echo "Selected GPU: $ROCR_VISIBLE_DEVICES"
exec $*
Some side notes, I forgot:
Those should be turned off as default (or at configure?). It's 1000s of files/lines for big runs:
MPMD: TCLB: local:108/128 work:108/128 --- connected to:
.xml
files saved in output directory per processAnd maybe we could support data-dumps to h5 format instead of lots of pri? or default to DUMPNAME/slice.pri. Again lots of files per folder, and it gets complicated to handle when doing restarted simulations. Most of the time you have ~2days worth.
There is also numbering error in at least pri - we assumed that 99 GPUs is enough, and leading zero is messy to handle in batch ;)
for d in {0..9}; do ln -s ./output/Restart-"$PREV"_Save_P0"$d"_00000000_"$d".pri restart_"$NEXT"_"$d".pri; done
for d in {10..1023}; do ln -s ./output/Restart-"$PREV"_Save_P"$d"_00000000_"$d".pri restart_"$NEXT"_"$d".pri; done
Ad printouts and xml: I agree it's cumbersome with many threads. There is couple of prints to be cleaned and xml files can be fixed to just export a single one.
Ad dumps: The "pri" files were just the simples (and fastest) way of doing things. We can switch to h5, but I think this should be as an option, as I don't want hdf5 to be an must dependency. As for the file mess, we can come up with some good way to arrange them. You can use the keep
parameter in SaveCheckpoint
to set how many previous dumps you want to store. It's implemented in a safe way (it writes before deleting the old ones).
Ad numbering: that is true that it's designed to have two digits for the rank, but I don't know if changing it now is a good idea. I agree it's a bit messy, but you can use: seq -f '%02.0f' 0 200
in bash to generate sequence with two digit formatting.
As for performance, @mdzik could you run some tests to compare double to double performance between nvidia and AMD? The "theory" is that AMD cards are designed for double precision.
As for performance, @mdzik could you run some tests to compare double to double performance between nvidia and AMD? The "theory" is that AMD cards are designed for double precision.
AFAIK I was mistaken - binary was double precission
TCLB was part of LUMI (https://lumi-supercomputer.eu/lumis-second-pilot-phase-in-full-swing/) Pilot Program which is now ending.
Apart from performance results, there are some issues that might be worth consideration. LUMI is a brand new CRAY/HPE computer with AMD Instinct MI250X 128GB HBM2e cards.
As for results, I made 0.8e9 lattice dissolution simulation for AGU :D That is around half of the 12cm experimental core at 30um resolution.
I still have few days left - if you want to check something we could do it.