chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.77k stars 417 forks source link

Should we support `CHPL_INTERCONNECT` / `CHPL_NETWORK`? #25616

Open bradcray opened 1 month ago

bradcray commented 1 month ago

Today, we support a CHPL_TARGET_PLATFORM variable that sometimes tells us a lot about the target platform if it's something specific like an HPE Cray EX or Cray XC system, but sometimes tells us little if it's a Linux cluster. In the latter case, the user has to set CHPL_COMM_* variables to specify how Chapel should map itself to the interconnect, using values like gasnet or ofi. In this issue, I'm wondering whether we should introduce a CHPL_INTERCONNECT or CHPL_NETWORK variable that would support values like none, slingshot, infiniband, ethernet, efa, unset, etc. as a higher-level way to say something about the target system that's higher-level and likely more known/knowable to a user than the details of how our communication is implemented. From there, we could then (typically) infer reasonable values for the lower-level CHPL_COMM* related variables (while still permitting a user to set them explicitly, if desired).

For example, I might imagine that setting CHPL_TARGET_PLATFORM=hpe-apollo would cause CHPL_INTERCONNECT to be inferred to be infiniband which would then cause CHPL_COMM to be inferred to gasnet and CHPL_COMM_SUBSTRATE to be inferred to be ibv (and so on). Yet on a Linux cluster that doesn't have a more specific platform identifier than linux64, a user could set CHPL_INTERCONNECT=infiniband and get the same lower-level settings. Or on an Apollo system, the user could override the default and set CHPL_COMM=ofi if they wanted to try the ofi-based implementation.

To me, this seems like it would prevent most users from ever having to set CHPL_COMM or its related variables, which feels like a win since that's more about how we implement things than about things a typical user would know, or should need to know.

bhurwitz33 commented 1 month ago

Yes! This definitely resonates with me. I love the idea of introducing a CHPL_INTERCONNECT variable. As Brad says above, this is easy to "know" from a user perspective, because it is easy to look up this info about your HPC system. Plus, if this info can then be used to infer CHPL_COMM* variables that would be great too. To be honest, when I first started, I didn't realize "ibv" stood for Infiniband, and if there was an easier starting point, that would be great!

e-kayrakli commented 1 month ago

The proposal can make building oversubscribed Chapel easy as well, but I can't tell how exactly. In the proposed world, what's the way to build Chapel with oversubscription? CHPL_INTERCONNECT=none && CHPL_COMM=gasnet? A new value for CHPL_INTERCONNECT? A completely new variable?

bradcray commented 1 month ago

@e-kayrakli: Hmm, good question. My first reaction was to require someone wanting oversubscription to use the lower-level variables thinking they'd somehow be "more expert" so should deserve the extra work, but thinking about it more, I think that wanting an oversubscribed Chapel for development purposes is pretty common, suggesting it should be similarly friendly. My thought would be to make it a new value like virtual or local which would result in defaults like CHPL_COMM=gasnet and CHPL_COMM_SUBSTRATE=udp or smp. I feel least excited about making it a new variable—it feels similar to CHPL_GPU=cpu to me where we also used a special value rather than a new variable.

e-kayrakli commented 1 month ago

but thinking about it more, I think that wanting an oversubscribed Chapel for development purposes is pretty common, suggesting it should be similarly friendly.

This strongly resonates with me. I don't view this mode to be a power-user mode. It may be so currently, but this proposal could be an excuse to improve the story there.

mppf commented 1 month ago

I like the way that this idea would allow us to hide implementation details (it's using gasnet or ofi). This would also address a point of user feedback where a user requested the abilitiy to simulate multiple locales on a single system without being aware that gasnet exists at all.

bradcray commented 1 month ago

I've taken the liberty of adding "user issue" here due to both Michael's connection to the previous issue and Bonnie's response.