ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
18 stars 34 forks source link

[redgreengpu] CRAY_ACC_ERROR: host region overlaps present region but is not contained for 'pgp3a(:,:,:,:)' #134

Open okkevaneck opened 2 months ago

okkevaneck commented 2 months ago

I've compiled and installed the redgreenbranch on LUMI-G and I ran the ectrans-benchmark-gpu-dp binary. This unfortunately resulted in the following error message:

ACC: libcrayacc/acc_present.c:679 CRAY_ACC_ERROR - Host region (b6bc740 to b6fb140) overlaps present region (b6bc140 to b6fae40 index 64) but is not contained for 'pgp3a(:,:,:,:)' from ../../../pfs/lustrep4/scratch/project_465000527/ovaneck/ectrans_dwarf/src/sources/ectrans/src/trans/gpu/internal/trltog_mod.F90:460

I'm clueless to what the problem may be, so I've also included my installation setup as a tar.gz for anyone to try:
ectrans_dwarf.tar.gz

Simply acquire an interactive LUMI-G compute node and execute ./install_redgreengpu.sh. This will clone, build, and install all required sources. Then afterwards, go into a login node, and cd into the run directory. Then sbatch the run_sbatch_lumi-g.sh script to get the error output in the err.<slurm_job_id>.0 file within the results/sbatch/ folder.

samhatfield commented 2 months ago

Hi @okkevaneck - we have seen these errors before. I have just tested redgreengpu on LUMI-G and I am able to run under my own build/run framework. So the question is, what is different about yours. I'll look into it.

By the way, ecKit and FCKit are not dependencies of ecTrans so you don't need to build those.

More generally, so everyone is on the same page, let me summarise the current support of AMD GPUs with ecTrans:

okkevaneck commented 2 months ago

Hi @samhatfield, thank you for the quick reply! Interesting that it's different, let me know if I can provide you with any extra info.

Good to know eckit and fckit are not dependencies, this will reduce our installation time by some bit.

Also many thanks for the overview of the current state. We heard from @reuterbal that we should use the redgreengpu branch as the main branch is currently not stable om AMD architectures, but it's also good to know the developments.

samhatfield commented 2 months ago

I wasn't able to follow your build instructions completely successfully. I get the interactive node with

salloc --nodes=1 --tasks=1 --cpus-per-task=32 --account=project_465000454 --gpus-per-task=1 --partition=dev-g --time=00:30:00

(is this wrong?)

Then I execute

srun -n 1 ./install_redgreengpu.sh lumi

The build finishes, but when I look at src/build/ectrans.log, I see

-- HIP target architecture: gfx803

It should be gfx90a. Sure enough, when I test the resulting binary, it doesn't work:

> srun -n 1 ./src/build/ectrans/bin/ectrans-benchmark-gpu-dp
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
srun: error: nid005006: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=7971168.2

Is there something I'm missing?

okkevaneck commented 2 months ago

I allocate the node slightly different and SSH onto the compute node, maybe that's what's causing the difference.

To allocate a node, I run:

#!/usr/bin/env bash

JOB_NAME="ia_gpu_dev"
GPUS_PER_NODE=8
NODES=1
NTASKS=8
PARTITION="dev-g"
ACCOUNT="project_465000454"
TIME="01:00:00"

# Allocate interactive node with the set variables above.
salloc \
    --gpus-per-node=$GPUS_PER_NODE \
    --exclusive \
    --nodes=$NODES \
    --ntasks=$NTASKS \
    --partition=$PARTITION \
    --account=$ACCOUNT \
    --time=$TIME \
    --mem=0 \
    --job-name=$JOB_NAME

Then to get onto the compute node, I execute the following from a login node: ROCR_VISIBLE_DEVICES=0 srun --cpu-bind=mask_cpu:0xfe000000000000 --nodes=1 --pty bash -i

And then I execute the script without any SLURM command, as we're already on the compute node: ./install_redgreengpu.sh lumi

I forgot about the ROCR_VISIBLE_DEVICES=0 and --cpu-bind=mask_cpu:0xfe000000000000, I think this is what could cause the behavior you're seeing. Let me know if it helped!

samhatfield commented 2 months ago

Will give it a go, thanks! I'm waiting quite long today to get allocated a node.

samhatfield commented 2 months ago

Now I see

-- HIP target architecture: gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a

which is good. I still found it difficult to get an interactive session on a compute node:

> ROCR_VISIBLE_DEVICES=0 srun --cpu-bind=mask_cpu:0xfe000000000000 --nodes=1 --pty bash -i                                                                                                                                                           srun: Warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 8 with the number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 7971505: More processors requested than permitted

Instead I ran

ROCR_VISIBLE_DEVICES=0 srun --ntasks=1 --pty bash -i

Now I've successfully built the binary. And I think I've found the cause of the problem. Could you try running without --nproma $NPROMA?

In my setup, I get the exact same error as you when I include --nproma 32. To be honest, this option is sort of irrelevant for ecTrans benchmarking because it determines the data layout in grid point space, but no calculations are done in grid point space. We usually don't specify this option at all when benchmarking ecTrans. But we do like to keep the option so we can replicate situations from the IFS (where NPROMA very much has consequences) in ecTrans. Therefore this option should work, and this is clearly a bug!

For now, if you just want to benchmark ecTrans, you can leave this option off. In the mean time I'll try to find the cause of this bug.

okkevaneck commented 2 months ago

Hmm interesting, I wonder why the interactive node works for me..

I tried running without --nproma 32 and it works, thank you very much! It does make me wonder, how do you alter the workload size with this version? I looked at an older version in the beginning of this year, which had the options to scale through the NLAT and NLON variables.

samhatfield commented 2 months ago

Great to hear it works. I'm figuring out how we might fix this so we can run with any NPROMA. Let's keep this issue open until we decide how to proceed.

With the benchmark program the problem size in both spectral and grid point space can be set by a single parameter -t, --truncation. This is the cutoff zonal and total wavenumber in spectral space. The higher this number, the higher the resolution, and the bigger the work arrays.

By default the benchmark driver will use an octahedral grid for grid point space with a cubic-accuracy representation of waves, which basically means the number of latitudes must be 2 * (truncation + 1). -t, --truncation 79 (which is the default if you don't specify the option) therefore gives an octahedral grid with 160 latitudes. The number of longitude points per latitude depends on the latitude -> it is greatest at the equator and tapers to 20 at the poles.

okkevaneck commented 2 months ago

Ah that's how it works! Many thanks Sam!