When we profile on a node with 2 CPUs (with 64 cores each) and 8 GPUs and enable ROCPROFILER, ROCTRACER and ROCM_SMI support, we typically see many lines of CPU performance metrics, one for each core. We then have to scroll all the way to the bottom to find the GPU activity and select that manually to see the HIP activity and the GPU activity close to each other. Is it possible to move the GPU activity closer to the CPU/HIP activity and near the top of the trace?
There are a couple of other questions:
If we have a slurm allocation for 1 GPU, it can be any 1 of say, 8 GPUs on the node. Will Omnitrace still collect metrics on all 8 GPUs? If we set up OMNITRACE_SAMPLING_GPUS = 0, would it collect only on 1 GPU and is that the same GPU that was allocated by slurm?
Is OMNITRACE_SAMPLING_CPUS = 0 the right way to limit collecting CPU metrics on only 1 CPU core to shorten the trace obtained?
AFAICT, the Perfetto GUI organizes those tracks alphabetically but I can look into if there are ways around that.
If I recall correctly, slurm only makes 1 GPU visible so you shouldn't have to set that metric. If that's not the case, you could also try setting it to %env{HIP_VISIBLE_DEVICES}% assuming slurm sets that
Yes, it also accepts "none" if you simply don't want any CPU frequency info.
When we profile on a node with 2 CPUs (with 64 cores each) and 8 GPUs and enable ROCPROFILER, ROCTRACER and ROCM_SMI support, we typically see many lines of CPU performance metrics, one for each core. We then have to scroll all the way to the bottom to find the GPU activity and select that manually to see the HIP activity and the GPU activity close to each other. Is it possible to move the GPU activity closer to the CPU/HIP activity and near the top of the trace?
There are a couple of other questions: