Use of MIOPEN_USER_DB_PATH for training speedup in sequential jobs settings

formiel commented 1 month ago

Hello,

I would like to ask if we can use MIOPEN_USER_DB_PATH to accelerate model training in a sequential job setting, where each job starts after the previous one has finished . As I checked the documentation, it is said that:

System FindDb can be cached into memory and may dramatically increase performance.

In my experiments, I observed a gradual speedup during the first run of model training as follows:

Step 0 - 5k: took 545 minutes
Step 5k - 10k: took 373 minutes
Step 10k - 15k: took 300 minutes
Step 15k - 20k: took 162 minutes
Step 20k - 25k: took 137 minutes
Step 25k - 30k: took 127 minutes
Step 30k - 35k: took 110 minutes.

However, I need to setup the jobs sequentially due to time constraints imposed by SLURM. During the second run, the model experienced similar phases as the first run, with step 35k - 40k taking 545 minutes and so on.

After reading a previous comment and the documentation, I wonder if setting the MIOPEN_USER_DB_PATH specific to each job (based on the experiment name) and SLURM process ID as below could help leverage the optimized convolutional kernels found in previous runs to make training faster:

export MIOPEN_USER_DB_PATH="$SCRATCH/tmp/miopen-cache/${CONFIG}_ngpus${NGPUs}/$SLURM_PROCID"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
mkdir -p ${MIOPEN_USER_DB_PATH}

If not, is there any way to sustain the performance observed in the previous run, such that the first 5k step of the next job takes 110 minutes please? As the same training on A100 takes 60 minutes for each 5K steps, the average run on MI250x as shown above would take around 250 minutes, which is more than 4 times longer than on A100.

Many thanks in advance for your response!

formiel commented 1 month ago

FYI, my experiments are run on 32 GPUs MI250x and my run.slurm is as below:

#!/bin/bash

module purge && module load cpe/23.12 craype-accel-amd-gfx90a craype-x86-trento PrgEnv-gnu amd-mixed/6.0.0 cray-python/3.10.10 aws-ofi-rccl && module list

source ~/env/py310_rocm60_torch241_main/bin/activate

export PYTHONPATH=$PYTHONPATH:$HOME/code/fairspeech_py310_rocm60_torch241_main
export FAIRSEQ=$HOME/code/fairspeech_py310_rocm60_torch241_main

export ROCBLAS_INTERNAL_FP16_ALT_IMPL=1
export MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL=1

export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
mkdir -p ${MIOPEN_USER_DB_PATH}

export MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES=0 # disable use of precompiled binaries for certain operations, including the Winograd convolution algorithm
export MIOPEN_DEBUG_CONV_FFT=0 # disable FFT convolution algorithm
export MIOPEN_DEBUG_CONV_DIRECT=0 # disable Direct convolution algorithm
# export MIOPEN_DEBUG_GCN_ASM_KERNELS=0 # disable hand-tuned asm. kernels for Direct convolution algorithm. Fall-back to kernels written in high-level language.

# echo des commandes lancees
set -x

NCPU_PER_PROC=$((NCPUS_PER_TASK / NGPUS_PER_NODE))
export OMP_NUM_THREADS=${NCPU_PER_PROC}
# export PYTHONPATH=$HOME/code/fairspeech # comment out to use load_torchxx.sh in Adastra

echo "Total ${NGPUs} GPUs requested, ${NGPUS_PER_NODE} GPUs per node"

MASTER_PORT=$(shuf -i 20000-65000 -n 1)

# export LOGLEVEL="INFO"
# eval "TMPDIR=$SCRATCH/tmp OMP_NUM_THREADS=${NCPUS_PER_TASK} srun ${COMMAND}"
eval 'srun \
--cpu-bind=none \
--mem-bind=none \
--label \
-- torchrun --nnodes="${SLURM_NNODES}" --nproc_per_node="${NGPUS_PER_NODE}" \
--rdzv-id="${SLURM_JOBID}" \
--rdzv-backend=c10d \
--rdzv-endpoint="$(scontrol show hostname ${SLURM_NODELIST} | head -n 1):${MASTER_PORT}" \
--max-restarts="1" \
${COMMAND}'

ppanchad-amd commented 1 month ago

Hi @formiel. Internal ticket has been created to assist with your issue. Thanks!

huanrwan-amd commented 1 month ago

Hi @formiel , are you able to run your run.slurm? and can you post the results? According to https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/perfdb.html, you can try system PerDb (which is "A system-wide storage that holds pre-run values for the most applicable configurations."), note that "User PerfDb always takes precedence over System PerfDb."

formiel commented 3 weeks ago

Hello @huanrwan-amd,

Thank you very much for your reply!! I encountered an error when setting MIOPEN_USER_DB_PATH to a local disk space in order to reuse the optimized kernels for subsequent runs. My colleague @etiennemlb suggested a solution: running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.

Due to the time and resource constraints, I’m unable to try this solution at the moment, but I’ll test it when possible and share the results with you later.

huanrwan-amd commented 3 weeks ago

Hi @formiel, thank you for your response. I will close the ticket for now.

netw0rkf10w commented 3 weeks ago

@huanrwan-amd Why close the issue? Isn't it a big issue if kernel caches cannot be used across sequential jobs?

huanrwan-amd commented 3 weeks ago

Hi @netw0rkf10w, this ticket is to address a specific issue for the originator. If you want to know more about kernel caches database, please refer https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/cache.html . Thanks.

etiennemlb commented 3 weeks ago

I agree you can't just close an issue like that, there is a significant performance issue and that is not fine. I would guess that AMD wants its platform to perform well on MI250X. If @formiel can't use MI250X for now, you could at least ask for a reproducer and work on your side.

Just to be clear, @huanrwan-amd, this is a discussion on the behavior of the cache db, and the doc you gave is scarce.

As @formiel said:

running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.

is that sound or wishful thinking ?

huanrwan-amd commented 3 weeks ago

Hi @etiennemlb,

Thanks for your comments. I’ve reopened the ticket as requested.

Could you please provide more information on the software, specifically the OS version and ROCm version? The latest release for ROCm is 6.2.2. ROCm Releases
Please enable logs as described here: Enable Logs

As mentioned in the documentation, the cache database has two types: system PerfDb (.kdb) and user PerfDb (.ukdb), located under /$HOME/.cache/miopen/ or another location set by the user. When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.

etiennemlb commented 3 weeks ago

Thanks, @huanrwan-amd.

The ROCm version is 6.0.0. But @formiel tested using pytorch+rocm6.1 and pytorch+rocm6.2. AFAIK, the problem was always present. The OS is RHEL 8.9.

When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.

@formiel , from that quote, Id say our guess could be right. @huanrwan-amd is there a way to hard fail/stop the program if a kernel is not found in the database ? This way if you could ensure that all the kernels are "precompiled" for a given workload then for production run you could ensure you will never compile kernels.

@huanrwan-amd is the "enable logs" you mention based on:?

export MIOPEN_ENABLE_LOGGING=1
export MIOPEN_ENABLE_LOGGING_CMD=1
export MIOPEN_LOG_LEVEL=6

huanrwan-amd commented 3 weeks ago

Hi @etiennemlb,

is there a way to hard fail/stop the program if a kernel is not found in the database?

I would suggest update to ROCm 6.2.2 and recording the logs first. Yes, you could have more details by adding MIOPEN_ENABLE_SQLITE for db:

export MIOPEN_ENABLE_SQLITE=1&&
export MIOPEN_LOG_LEVEL=7

huanrwan-amd commented 3 weeks ago

Hi @etiennemlb and @formiel , For the cache db path as in the run.slurm mentioned by @formiel : MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}" Please note that each SLURM_JOB_ID will generate a *.ukdb file.

ROCm / MIOpen

Use of MIOPEN_USER_DB_PATH for training speedup in sequential jobs settings #3322