Feature/rmsv2 - Githubissues

omri-amd commented 1 month ago

draft PR. to be review and worked on before any merges.

collector_rms proposed changes:

add a new method to fetch slurm job data using scontrol a. ~using scontrol, we can fetch all data from a central place, no need to collect logs from each compute node~ changed behavior to exact;y the same as now, with addtional data that can come from scontrol show job id <id> -d --json b. added support for multiple jobs running on the same compute node c. added support for GPU id that is being used in each job. d. added support for CPU id that is being used in each job.

TODO a. ~feature (label) parity with existing collector_rms~ Done b. ~test jobs that do not use GPUs~ Done c. ~grafana dashboard samples with new gpu/cpu ids~ Done, will be committed at a later date, in a separate PR.

performance wise, we see ~ ~150~ 100 ms being added to the Prometheus query on a 15 active job cluster. (~30 for just amdsmi, ~ ~180~ 130 amdsmi + rms v2)

will conduct more performance testing once we get a quorum to continue this work.

Sample output: rmsjob_info{account="AMD",batchflag="0",card="0",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0 rmsjob_info{account="AMD",batchflag="0",card="1",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0

jordap commented 1 month ago

I haven't looked at the code yet. But before considering support for multiple jobs in the same node, I think it would be important to understand the changes you are proposing to the RMS collector and its impact on query complexity and performance. I'd suggest providing: 1) An example of how one of the current queries in the job dashboard needs to be updated. 2) Estimation of performance impact for dashboard queries.

omri-amd commented 1 month ago

hi @jordap, @koomie we have made additional changes to keep it in line with existing implementation, once we review it we can consider merging it with an additional config flag rms_extra_info=True for example, as its 100% compatible with existing collector_rms code.

samples from grafana (GPU utilization, slurm job is using 2 nodes with 2 gpus each, [8 total gpus in each node]) top chart is our modified chart, showing 100% usage (of 4 of the requested GPU) bottom chart is what currently exists, showing 25% usage across the 2 nodes. (4 GPUs at 100% out of 16)

queries: (amdsmi to be replaced by rocm prefix) to add support for Card, add on (instance, card) avg(amdsmi_average_gfx_activity * on (instance, card) group_left() rmsjob_info{jobid="$jobid"}) to add support to existing panel for multi job (only required if you dont work in exclusive mode) group by (instance) avg(amdsmi_average_gfx_activity * on (instance) group_left() group by (instance) (rmsjob_info{jobid="$jobid"}))

jordap commented 1 month ago

I looked into the changes and performed some additional testing. The approach in this PR can work, particularly at smaller scales. But it introduces a significant number of changes and has an impact on scalability. This is a useful feature, and I think it would be best for us to spend more time to figure out if this is the best approach to support shared nodes in the long run.

More detailed comments and description of some of the issues we need to consider:

Dashboards. We'll need dedicated sets of dashboards for clusters with this option enabled. I was hoping we could keep it compatible and be able to enable this option in a subset of nodes. Unfortunately, all queries need to be updated and they are incompatible with existing queries (different join). We'd need to automate the conversion of existing dashboards with a script (doable), but all nodes in the cluster will need to use the same collector. It will also break some panels like the job allocation timeline in the Node dashboard, which won't be as easy to fix/automate.
Annotations. The current implementation breaks annotations and we should find a way to make it work.
Implementation. It's currently an entirely separate collector, but ideally we'd like it to be an option of the rms collector. There is also room to improve the performance of the current implementation (~10ms for rms vs ~110ms for rmsv2).
Node exporter. It's not clear what to do with node_exporter metrics. With this approach, the obvious option to support node_exporter is to add more labels for different resources (cores, memory). But there is no way to keep track of all resources (think IO, network). It would make sense to simply remove all node_exporter panels when this option is enabled, but that's a significant downgrade. Alternatively, we could display some of these resources as "shared" with other users/jobs, but that may not always be desirable.
Scale. One of the main concerns is the amount of labels this change introduces, particularly if we want to support CPU and additional resources. card itself isn't a very high-cardinality label, in current systems it will only have ~8 possible values. But we need to keep in mind that other labels in rmsjob_info already have high cardinality and churn potential. If we keep adding more labels we'll run into issues sooner than later.

I think we need to keep looking into the performance of each single sample, try to find a way to make the data/queries compatible between exclusive and shared, and figure out what the plans are for combining data with node_exporter. It may also be helpful for us to estimate the number of active metrics for different cluster configurations and set scalability targets.

jordap commented 1 month ago

Scale. One of the main concerns is the amount of labels this change introduces, particularly if we want to support CPU and additional resources. card itself isn't a very high-cardinality label, in current systems it will only have ~8 possible values. But we need to keep in mind that other labels in rmsjob_info already have high cardinality and churn potential. If we keep adding more labels we'll run into issues sooner than later.

For reference, in a hypothetical cluster with 10k nodes,100k GPUs, and a moderate churn of 10-minute jobs, the number of active metrics grows significantly when adding the card label to rmsjob_info: from ~1.3M metrics to ~2.4M metrics. And it can get worse if jobs are shorter.

We could decide that that's OK for this feature, particularly if we can make this work in a subset of nodes. But that's only for GPUs. If we were to add more labels for other resources like cores, which can have values in the [0-512) range, the number of metrics will explode really fast and I don't think this approach will work.

jordap commented 1 month ago

Dashboards. We'll need dedicated sets of dashboards for clusters with this option enabled. I was hoping we could keep it compatible and be able to enable this option in a subset of nodes. Unfortunately, all queries need to be updated and they are incompatible with existing queries (different join). We'd need to automate the conversion of existing dashboards with a script (doable), but all nodes in the cluster will need to use the same collector. It will also break some panels like the job allocation timeline in the Node dashboard, which won't be as easy to fix/automate.

One option to make the same dashboards work in both scenarios (with and without card label in rmsjob_info) is using an or:

(rocm_utilization_percentage * on (instance,card) group_left() rmsjob_info{jobid="$jobid",card!=""}) or (rocm_utilization_percentage * on (instance) group_left() rmsjob_info{jobid="$jobid",card=""})

This will attempt to perform both queries, but only one of them will succeed. The downside is the obvious duplication and increased complexity in every query.

AMDResearch / omnistat

Feature/rmsv2 #111