flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

WIP: support JSM ecosystem on coral system #90

Open garlick opened 11 months ago

garlick commented 11 months ago

This is a WIP to collect fixes needed to get flux-pmix working on the LLNL lassen system as proposed in #85

codecov[bot] commented 11 months ago

Codecov Report

Merging #90 (d0b630d) into main (d25a5b4) will not change coverage. The diff coverage is n/a.

:exclamation: Current head d0b630d differs from pull request most recent head 1b50093. Consider uploading reports for the commit 1b50093 to get more accurate results

@@           Coverage Diff           @@
##             main      #90   +/-   ##
=======================================
  Coverage   78.13%   78.13%           
=======================================
  Files          12       12           
  Lines        1413     1413           
=======================================
  Hits         1104     1104           
  Misses        309      309           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

garlick commented 11 months ago

I dropped the WIP on the title.

This addresses some problems that have nothing to do with coral.

As far as coral status, jsrun booting flux has been demonstrated, but we've yet to find a pmix server package that we can properly link with, nor a hwloc library it turns out. Flux may end up being packaged as a module in /usr/tce on this system. At that point we'll see if any issues remain.

garlick commented 11 months ago

I've got an alternate solution to the shmem debacle working so I'm going to split this PR into parts and resubmit. Sorry for the flailing around!

garlick commented 11 months ago

OK, the flux-pmix tests in the CI build against pmix 3.1.2 are failing the same way as noted in #85, e.g

2023-10-04T18:44:39.8275512Z 0.059s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v3: /usr/lib/pmix/mca_bfrops_v3.so: undefined symbol: pmix_bfrops_base_print_ptr (ignored)
2023-10-04T18:44:39.8276259Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v12: /usr/lib/pmix/mca_bfrops_v12.so: undefined symbol: pmix_buffer_t_class (ignored)
2023-10-04T18:44:39.8276825Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v21: /usr/lib/pmix/mca_bfrops_v21.so: undefined symbol: pmix_bfrops_base_print_ptr (ignored)
2023-10-04T18:44:39.8277385Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v20: /usr/lib/pmix/mca_bfrops_v20.so: undefined symbol: pmix_bfrops_base_print_datatype (ignored)
2023-10-04T18:44:39.8277764Z 0.060s: flux-shell[0]: stderr: --------------------------------------------------------------------------
2023-10-04T18:44:39.8278110Z 0.060s: flux-shell[0]: stderr: We were unable to find any usable plugins for the BFROPS framework. This PMIx
2023-10-04T18:44:39.8278442Z 0.060s: flux-shell[0]: stderr: framework requires at least one plugin in order to operate. This can be caused
2023-10-04T18:44:39.8278658Z 0.060s: flux-shell[0]: stderr: by any of the following:
2023-10-04T18:44:39.8278805Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8279234Z 0.060s: flux-shell[0]: stderr: * we were unable to build any of the plugins due to some combination
2023-10-04T18:44:39.8279519Z 0.060s: flux-shell[0]: stderr:   of configure directives and available system support
2023-10-04T18:44:39.8279786Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8280100Z 0.060s: flux-shell[0]: stderr: * no plugin was selected due to some combination of MCA parameter
2023-10-04T18:44:39.8280528Z 0.060s: flux-shell[0]: stderr:   directives versus built plugins (i.e., you excluded all the plugins
2023-10-04T18:44:39.8280764Z 0.060s: flux-shell[0]: stderr:   that were built and/or could execute)
2023-10-04T18:44:39.8280914Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8281234Z 0.060s: flux-shell[0]: stderr: * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
2023-10-04T18:44:39.8281549Z 0.060s: flux-shell[0]: stderr:   "mca_base_component_path", is set and doesn't point to any location
2023-10-04T18:44:39.8281957Z 0.060s: flux-shell[0]: stderr:   that includes at least one usable plugin for this framework.
2023-10-04T18:44:39.8282109Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8282380Z 0.060s: flux-shell[0]: stderr: Please check your installation and environment.
2023-10-04T18:44:39.8282645Z 0.060s: flux-shell[0]: stderr: --------------------------------------------------------------------------
2023-10-04T18:44:39.8282869Z 0.059s: flux-shell[0]:  WARN: pmix: PMIx_server_init: SILENT_ERROR
2023-10-04T18:44:39.8283200Z 0.059s: flux-shell[0]: ERROR: plugin 'pmix': shell.init failed
2023-10-04T18:44:39.8283375Z 0.059s: flux-shell[0]: FATAL: shell_init

It's hard to piece together what is going on in pmix/ompi land but it seems like maybe not linking the mca dsos against libpmix.so was an oversight in that old version?

This describes the problem as one with static builds (not applicable here): https://github.com/openpmix/openpmix/pull/1188 However, the proposed fix was confirmed to fix a non-static build of 3.1.2: https://github.com/openpmix/openpmix/issues/1186

So maybe the installed 3.1.2 on lassen is just unusable?

garlick commented 11 months ago

Well. 3.1.2 may not even be in use on coral - it just happens to be the newest packaged version that includes the server headers. The version that jsm is built with is 3.1.4. Pushing out a new package for 3.1.4 may be a logical thing to do there. I'll try changing pmix's minimum version and the CI build to 3.1.4 and see how that goes.

garlick commented 11 months ago

lots of failures in flux-pmix unit tests with 3.1.4. Sigh. I dropped the CI commit for now. Will revisit this branch later. Adding back the WIP.