cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

Allow PSM2 to work with "self,shm" devices without omnipath hardware #59

Closed chuckcranor closed 2 years ago

chuckcranor commented 3 years ago

Both PSM and PSM2 allow you to specify the devices you want to configure (i.e. the ptls) using the PSM_DEVICES and PSM2_DEVICES environment variables. For example, setting the variable to "self,shm" tells the library to configure the self and shared memory PTL layers, but not the IPS layer that requires PSM/PSM2 infiniband HCAs. This allows users to do some basic PSM/PSM2 CI testing (e.g. with travis) without having to have to have infiniband hardware in the CI testing infrastructure.

Unfortunately, this feature is currently broken in PSM2 due to the introduction of the hal layer. psm2_init() always calls psmi_hal_initialize(). psmi_hal_initialize() will fail if infiniband hardware is not present, resulting in psm2_init() failing with a "PSM Unresolved internal error" even if PSM2_DEVICES is set to "self,shm" ...

This patch resolves this issue and allows PSM2 to operate with PSM2_DEVICES set to "self,shm" ...

To do this, we have psm2_init() examine PSM2_DEVICES and only call psmi_hal_initialize() if the PTL_DEVID_IPS device is in use. If psm2_init() determines that PTL_DEVID_IPS is not needed, then we install a "null" hal into "psmi_hal_current_hal_instance" using a new hal API call psmi_hal_initialize_null().

We make psm_ep.c's psmi_parse_devices() and psmi_device_is_enabled() non-static so that psm2_init() can call them from psm.c

with the fix on a node without PSM2 hardware:

int main(int argc, char **argv) {
  int ver_major, ver_minor;
  psm2_error_t err;

  ver_major = PSM2_VERNO_MAJOR;
  ver_minor = PSM2_VERNO_MINOR;
  err = psm2_init(&ver_major, &ver_minor);
  if (err == PSM2_OK) {
    printf("psm2_init: OK, version is %d.%d\n", ver_major, ver_minor);
  } else {
    printf("psm2_init: ERR %s\n", psm2_error_get_string(err));
  }
  exit(0);
}

results in:

% ./p-test
psm2_init: ERR PSM Unresolved internal error
% env PSM2_DEVICES=self,shm ./p-test
psm2_init: OK, version is 2.2
%

Other PSM2 calls work over shm with this (e.g. irecv, isend, ipeek, test,...)

mwheinz commented 3 years ago

This is too late in the current development cycle for the next release, but I'll take a look at it for the release afterwards. I'm honestly not sure how generally useful it will be but if it doesn't hurt I don't see a problem with doing it.

chuckcranor commented 2 years ago

Looks good. Just add Signed-off-by line to commit message, please.

done