ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

[Issue]: Recent changes to 6.0.1-112 make it ABI incompatible to 6.0.0 #152

Open bertwesarg opened 5 months ago

bertwesarg commented 5 months ago

Problem Description

6.0.1-112 includes this change to the rocm_smi.h header:

@@ -1091,6 +1099,19 @@
   uint16_t current_vclk0s[RSMI_MAX_NUM_CLKS];
   uint16_t current_dclk0s[RSMI_MAX_NUM_CLKS];

+  /*
+   * v1.5 additions
+   */
+  // JPEG activity percent (encode/decode)
+  uint16_t jpeg_activity[RSMI_MAX_NUM_JPEG_ENGS];
+
+  // PCIE NAK sent accumulated count
+  uint32_t pcie_nak_sent_count_acc;
+
+  // PCIE NAK received accumulated count
+  uint32_t pcie_nak_rcvd_count_acc;
+
+
   /// \endcond
 } rsmi_gpu_metrics_t;

but this change of the size of type rsmi_gpu_metrics_t breaks the ABI to 6.0.0. And currently both librocm_smi.so libs have the same interface version 6.

oliveiradan commented 5 months ago

Problem Description

6.0.1-112 includes this change to the rocm_smi.h header:

@@ -1091,6 +1099,19 @@
   uint16_t current_vclk0s[RSMI_MAX_NUM_CLKS];
   uint16_t current_dclk0s[RSMI_MAX_NUM_CLKS];

+  /*
+   * v1.5 additions
+   */
+  // JPEG activity percent (encode/decode)
+  uint16_t jpeg_activity[RSMI_MAX_NUM_JPEG_ENGS];
+
+  // PCIE NAK sent accumulated count
+  uint32_t pcie_nak_sent_count_acc;
+
+  // PCIE NAK received accumulated count
+  uint32_t pcie_nak_rcvd_count_acc;
+
+
   /// \endcond
 } rsmi_gpu_metrics_t;

but this change of the size of type rsmi_gpu_metrics_t breaks the ABI to 6.0.0. And currently both librocm_smi.so libs have the same interface version 6.

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed? Or that the ABI versioning still needs to be updated to reflect the change? Like 6.0.1, for example?

bertwesarg commented 5 months ago

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed?

Correct.

Like 6.0.1, for example?

This is a particular bad example in this case, as it does not tell the linker/runtime loader, that librocm_smi64.so.6.0.60001 is not compatible to librocm_smi64.so.6.0.60000. They both still have the same SONAME which is librocm_smi64.so.6, and binaries/libraries only record the SONAME in there dependencies (DT_NEEDED)

oliveiradan commented 5 months ago

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed?

Correct.

Like 6.0.1, for example?

This is a particular bad example in this case, as it does not tell the linker/runtime loader, that librocm_smi64.so.6.0.60001 is not compatible to librocm_smi64.so.6.0.60000. They both still have the same SONAME which is librocm_smi64.so.6, and binaries/libraries only record the SONAME in there dependencies (DT_NEEDED)

That's a great point. We expect the rsmi_gpu_metric_t type size to change when new elements are added (which is happening more often due to requirements), but we need to change the SONAME to reflect the change and not break the existing ABI. As it follows the ROCm versioning, let me check on that so we can address it accordingly.

Question: Did it break your application? or are you just saying it should have a different version to identify the change? Thank you for bringing that up!

bertwesarg commented 5 months ago

I'm just reading the code. There are at least two possible ways I know to prevent changing the soname every time a struct changes its size:

  1. Use symbol versions for functions taking the struct as argument
  2. Let the caller pass the size of the struct to the functions

I think its too late for option 2., but 1. should still be able to add, but it wont avoid the increment in this bugfix