Open formiel opened 1 month ago
FYI, my experiments are run on 32 GPUs MI250x and my run.slurm is as below:
#!/bin/bash
module purge && module load cpe/23.12 craype-accel-amd-gfx90a craype-x86-trento PrgEnv-gnu amd-mixed/6.0.0 cray-python/3.10.10 aws-ofi-rccl && module list
source ~/env/py310_rocm60_torch241_main/bin/activate
export PYTHONPATH=$PYTHONPATH:$HOME/code/fairspeech_py310_rocm60_torch241_main
export FAIRSEQ=$HOME/code/fairspeech_py310_rocm60_torch241_main
export ROCBLAS_INTERNAL_FP16_ALT_IMPL=1
export MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL=1
export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
mkdir -p ${MIOPEN_USER_DB_PATH}
export MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES=0 # disable use of precompiled binaries for certain operations, including the Winograd convolution algorithm
export MIOPEN_DEBUG_CONV_FFT=0 # disable FFT convolution algorithm
export MIOPEN_DEBUG_CONV_DIRECT=0 # disable Direct convolution algorithm
# export MIOPEN_DEBUG_GCN_ASM_KERNELS=0 # disable hand-tuned asm. kernels for Direct convolution algorithm. Fall-back to kernels written in high-level language.
# echo des commandes lancees
set -x
NCPU_PER_PROC=$((NCPUS_PER_TASK / NGPUS_PER_NODE))
export OMP_NUM_THREADS=${NCPU_PER_PROC}
# export PYTHONPATH=$HOME/code/fairspeech # comment out to use load_torchxx.sh in Adastra
echo "Total ${NGPUs} GPUs requested, ${NGPUS_PER_NODE} GPUs per node"
MASTER_PORT=$(shuf -i 20000-65000 -n 1)
# export LOGLEVEL="INFO"
# eval "TMPDIR=$SCRATCH/tmp OMP_NUM_THREADS=${NCPUS_PER_TASK} srun ${COMMAND}"
eval 'srun \
--cpu-bind=none \
--mem-bind=none \
--label \
-- torchrun --nnodes="${SLURM_NNODES}" --nproc_per_node="${NGPUS_PER_NODE}" \
--rdzv-id="${SLURM_JOBID}" \
--rdzv-backend=c10d \
--rdzv-endpoint="$(scontrol show hostname ${SLURM_NODELIST} | head -n 1):${MASTER_PORT}" \
--max-restarts="1" \
${COMMAND}'
Hi @formiel. Internal ticket has been created to assist with your issue. Thanks!
Hi @formiel , are you able to run your run.slurm? and can you post the results? According to https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/perfdb.html, you can try system PerDb (which is "A system-wide storage that holds pre-run values for the most applicable configurations."), note that "User PerfDb always takes precedence over System PerfDb."
Hello @huanrwan-amd,
Thank you very much for your reply!! I encountered an error when setting MIOPEN_USER_DB_PATH to a local disk space in order to reuse the optimized kernels for subsequent runs. My colleague @etiennemlb suggested a solution: running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.
Due to the time and resource constraints, I’m unable to try this solution at the moment, but I’ll test it when possible and share the results with you later.
Hi @formiel, thank you for your response. I will close the ticket for now.
@huanrwan-amd Why close the issue? Isn't it a big issue if kernel caches cannot be used across sequential jobs?
Hi @netw0rkf10w, this ticket is to address a specific issue for the originator. If you want to know more about kernel caches database, please refer https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/cache.html . Thanks.
I agree you can't just close an issue like that, there is a significant performance issue and that is not fine. I would guess that AMD wants its platform to perform well on MI250X. If @formiel can't use MI250X for now, you could at least ask for a reproducer and work on your side.
Just to be clear, @huanrwan-amd, this is a discussion on the behavior of the cache db, and the doc you gave is scarce.
As @formiel said:
running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.
is that sound or wishful thinking ?
Hi @etiennemlb,
Thanks for your comments. I’ve reopened the ticket as requested.
As mentioned in the documentation, the cache database has two types: system PerfDb (.kdb) and user PerfDb (.ukdb), located under /$HOME/.cache/miopen/ or another location set by the user. When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.
Thanks, @huanrwan-amd.
The ROCm version is 6.0.0. But @formiel tested using pytorch+rocm6.1 and pytorch+rocm6.2. AFAIK, the problem was always present. The OS is RHEL 8.9.
When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.
@formiel , from that quote, Id say our guess could be right. @huanrwan-amd is there a way to hard fail/stop the program if a kernel is not found in the database ? This way if you could ensure that all the kernels are "precompiled" for a given workload then for production run you could ensure you will never compile kernels.
@huanrwan-amd is the "enable logs" you mention based on:?
export MIOPEN_ENABLE_LOGGING=1
export MIOPEN_ENABLE_LOGGING_CMD=1
export MIOPEN_LOG_LEVEL=6
Hi @etiennemlb,
is there a way to hard fail/stop the program if a kernel is not found in the database?
I would suggest update to ROCm 6.2.2 and recording the logs first. Yes, you could have more details by adding MIOPEN_ENABLE_SQLITE for db:
export MIOPEN_ENABLE_SQLITE=1&&
export MIOPEN_LOG_LEVEL=7
Hi @etiennemlb and @formiel ,
For the cache db path as in the run.slurm mentioned by @formiel :
MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
Please note that each SLURM_JOB_ID will generate a *.ukdb file.
Hello,
I would like to ask if we can use
MIOPEN_USER_DB_PATH
to accelerate model training in a sequential job setting, where each job starts after the previous one has finished . As I checked the documentation, it is said that:In my experiments, I observed a gradual speedup during the first run of model training as follows:
However, I need to setup the jobs sequentially due to time constraints imposed by SLURM. During the second run, the model experienced similar phases as the first run, with step 35k - 40k taking 545 minutes and so on.
After reading a previous comment and the documentation, I wonder if setting the
MIOPEN_USER_DB_PATH
specific to each job (based on the experiment name) and SLURM process ID as below could help leverage the optimized convolutional kernels found in previous runs to make training faster:If not, is there any way to sustain the performance observed in the previous run, such that the first 5k step of the next job takes 110 minutes please? As the same training on A100 takes 60 minutes for each 5K steps, the average run on MI250x as shown above would take around 250 minutes, which is more than 4 times longer than on A100.
Many thanks in advance for your response!