We are running a large KNL-F cluster and recently updated to the latest OFA release with libpsm2-11.2.156. However, all MPIs using psm2 fail on the system when multiple tasks are used per port. The reason is that we have an inactive HFI hfi1_1, which psm2 detects but then still trys to open. The underlying bug appears to be caused by the fact that the active/inactive status is stored in psm_hal.c in an unsigned array. This was (silently) fixed in PSM2 2.11.173. Would it be possible to produce a full release (I understand 11.2.173 is only an "interim" release, so probably not gone through full quality assurance?) that includes this patch and also push this for a full release by Intel?
For completeness, I am refering to the following changes:
We are running a large KNL-F cluster and recently updated to the latest OFA release with
libpsm2-11.2.156
. However, all MPIs usingpsm2
fail on the system when multiple tasks are used per port. The reason is that we have an inactive HFIhfi1_1
, whichpsm2
detects but then still trys to open. The underlying bug appears to be caused by the fact that the active/inactive status is stored in psm_hal.c in an unsigned array. This was (silently) fixed in PSM2 2.11.173. Would it be possible to produce a full release (I understand 11.2.173 is only an "interim" release, so probably not gone through full quality assurance?) that includes this patch and also push this for a full release by Intel?For completeness, I am refering to the following changes: