cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

Please create a full release with psm_hal.h/c changes from 11.2.173 #53

Closed krausedfzj closed 4 years ago

krausedfzj commented 4 years ago

We are running a large KNL-F cluster and recently updated to the latest OFA release with libpsm2-11.2.156. However, all MPIs using psm2 fail on the system when multiple tasks are used per port. The reason is that we have an inactive HFI hfi1_1, which psm2 detects but then still trys to open. The underlying bug appears to be caused by the fact that the active/inactive status is stored in psm_hal.c in an unsigned array. This was (silently) fixed in PSM2 2.11.173. Would it be possible to produce a full release (I understand 11.2.173 is only an "interim" release, so probably not gone through full quality assurance?) that includes this patch and also push this for a full release by Intel?

For completeness, I am refering to the following changes:

diff --git a/psm2_hal.c b/psm2_hal.c
index 6ba281a..b4b9d9a 100644
--- a/psm2_hal.c
+++ b/psm2_hal.c
@@ -326,14 +326,10 @@ static struct _psmi_hal_instance *psmi_hal_get_pi_inst(void)
                                p->params.num_ports = nports;
                                p->params.default_pkey = dflt_pkey;
                                p->params.sw_status |= valid_flags;
-                               p->params.unit_active = (uint8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits,
-                                                                               sizeof(uint8_t));
-                               p->params.unit_active_valid = (uint8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits,
-                                                                                     sizeof(uint8_t));
-                               p->params.port_active = (uint8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits*nports,
-                                                                               sizeof(uint8_t));
-                               p->params.port_active_valid = (uint8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits*nports,
-                                                                                     sizeof(uint8_t));
+                               p->params.unit_active = (int8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits, sizeof(int8_t));
+                               p->params.unit_active_valid = (int8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits, sizeof(int8_t));
+                               p->params.port_active = (int8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits*nports, sizeof(int8_t));
+                               p->params.port_active_valid = (int8_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits*nports, sizeof(int8_t));
                                p->params.num_contexts = (uint16_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits,
                                                                                  sizeof(uint16_t));
                                p->params.num_contexts_valid = (uint16_t *) psmi_calloc(PSMI_EP_NONE, UNDEFINED, nunits,
diff --git a/psm2_hal.h b/psm2_hal.h
index 02d62a2..1bec596 100644
--- a/psm2_hal.h
+++ b/psm2_hal.h
@@ -183,8 +183,8 @@ typedef struct _psmi_hal_params
        uint16_t   num_units;
        uint16_t   num_ports;
        uint16_t   default_pkey;
-       uint8_t    *unit_active,*unit_active_valid;
-       uint8_t    *port_active,*port_active_valid;
+       int8_t     *unit_active,*unit_active_valid;
+       int8_t     *port_active,*port_active_valid;
        uint16_t   *num_contexts,*num_contexts_valid;
        uint16_t   *num_free_contexts,*num_free_contexts_valid;
 } psmi_hal_params_t;
krausedfzj commented 4 years ago

Correction: it should have read "... multiple tasks per node."

BrendanCunningham commented 4 years ago

This fix is in IFS 10.10.2.21.

Source code for this release has been upstreamed as PSM2 11.2.166.