In a previous change we began using "MI300" for gpu_model instead of the full "MI300X_A0" or "MI300X_A1", etc.
The XCD detection code was receiving gpu_model and expecting the full name, causing the XCD count = 1 and several metrics to be off by a factor of 8 (e.g. VALU utilization, wavefront occupancy).
Passing chip_id instead of gpu_model fixes the issue.
In a previous change we began using "MI300" for gpu_model instead of the full "MI300X_A0" or "MI300X_A1", etc.
The XCD detection code was receiving gpu_model and expecting the full name, causing the XCD count = 1 and several metrics to be off by a factor of 8 (e.g. VALU utilization, wavefront occupancy).
Passing chip_id instead of gpu_model fixes the issue.