Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
61 stars 53 forks source link

Add initial hwloc support #1108

Closed philipmarshall21 closed 7 months ago

philipmarshall21 commented 7 months ago

This PR is a subset of the changes PR #1107 and is intended to isolate the integration of hwloc as an optional dependency to SOS from the addition of multi-NIC functionality

The changes to the oac_* files reflect the current status of the main branch (at the time of this PR, that is commit c1cfc910d92af43f8c27807a9a84c9c13f4fbc65) as the upstream repo does not have any tags.

The changes to the opal_* files align with the v5.0.1 tag of the upstream repo.

davidozog commented 7 months ago

Are the upstream OPAL changes taken from a release or a particular commit of OMPI? Either way, perhaps we should leave a note on this PR in case it ends up being helpful later.

philipmarshall21 commented 7 months ago

Are the upstream OPAL changes taken from a release or a particular commit of OMPI? Either way, perhaps we should leave a note on this PR in case it ends up being helpful later.

The changes to the oac_* files reflect the current status of the main branch (currently, that is commit c1cfc910d92af43f8c27807a9a84c9c13f4fbc65) as the upstream repo does not have any tags.

The changes to the opal_* files align with the v5.0.1 tag of the upstream repo.

davidozog commented 7 months ago

Side note: The UCX row with pmi-mpi looks slow in testing, but I think we can pretty safely ignore that...

philipmarshall21 commented 7 months ago

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.
wrrobin commented 7 months ago

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.

For 1. I would guess we should see the rpath, considering your changes in the PR. Are you using any other flags that might skip the configury code? For 2. my thought is we should not terminate. If one of the hwloc APIs fail or return error, it should print a warning that hwloc is not enabled for some reason. But, we should let the application proceed. Do other runtimes behave the same?

philipmarshall21 commented 7 months ago

Two quick questions @wrrobin @davidozog:

  1. With the addition of hwloc support, should we expect to see the rpath for hwloc when doing something like oshcc -show (rpath for libfabric and libsma are being set correctly)? If so, this is currently not the case, and I'll need to look into why this is not happening.
  2. In the changes to init.c that add hwloc API calls, should the application exit if the calls fail? Right now an ERROR message is printed but the application continues running.

For 1. I would guess we should see the rpath, considering your changes in the PR. Are you using any other flags that might skip the configury code? For 2. my thought is we should not terminate. If one of the hwloc APIs fail or return error, it should print a warning that hwloc is not enabled for some reason. But, we should let the application proceed. Do other runtimes behave the same?

Ok, I've made the suggested changes (ensure rpath for hwloc is set in configure.ac, failed hwloc API calls now send warning message rather than error).