Closed lydia-duncan closed 5 years ago
@gbtitus, @mppf and @ronawho - I would be especially curious to hear your thoughts on this
(also, feel free to correct any errors in this, I might have mischaracterized something)
Is it possible that someday we might want to do this same sort of thing between a client running on one node of an IB cluster and a server running on a number of other nodes of that cluster? Or even, between a client running on a laptop and a server running on a multi-node AWS instance? The latter might be a stretch at this point, but the former could happen today, for example if someone wanted to use a lighter weight server backend during development. If so, instead of differentiating based on Cray versus not-Cray would it make sense to differentiate based on using a launcher versus not using a launcher?
Separately, might it make sense to add a CHPL_RT_LIB_ML_HOST
runtime environment variable, to allow users to specify a connection host? Then could we default to 0.0.0.0
(with the associated connection exposure) but they could direct us to use something else if that was a concern for them. It might even let them do things like ssh-tunneling the connection.
It just seems the same as the need for export GASNET_MASTERIP=127.0.0.1
. When a system has multiple interfaces, it doesn't always work right to listen on a port without specifying which interface to use.
I'm not sure how to differentiate the different cases but I agree with Greg that Cray vs Non-Cray is probably not the right way.
I believe we always have CHPL_LAUNCHER
set to something other than none
when CHPL_COMM != none
, unless the user has explicitly set it otherwise. So unless there are certain launcher settings that only come up when on a system that should just talk to itself (which there may be, I'm not as familiar with these), I don't think we'll be able to determine when to use which style.
I believe we always have
CHPL_LAUNCHER
set to something other thannone
whenCHPL_COMM != none
, ...
We should have it set whenever we need to use a launcher. For example, we always have CHPL_LAUNCHER
set on Cray XC systems if we want to run on the compute nodes, even with CHPL_COMM=none
. It also should be possible to set CHPL_LAUNCHER=slurm-srun
with CHPL_COMM=none
on a vanilla linux64 cluster with a slurm WLM and thus have single-node Chapel programs launched into slurm jobs automagically. Or at least, if that doesn't work today it shouldn't be hard to get there.
I would be okay with having a set of values for CHPL_LAUNCHER
that would trigger hostname use, while the rest default to the other setting, but I would need some help determining which ones those are.
I know at least amudprun
can fail me in some cases and work in others, so I would put that on the "not" list.
Supposing the hostname is usually what we want, would just going off of GASNET_MASTERIP
work to find the machine-local (testing) configurations? (Don't you have to set that anyway?)
... would just going off of
GASNET_MASTERIP
work ...
I only set that on my MacBook (OS X), I believe to avoid problems when VPN'ed. I vaguely recall needing to set it on some other system(s) in the distant past, but don't recall the details. I don't currently set it other than on my MacBook.
I do have to have that set on the machine where I had problems. I think my only argument against that is that we will likely move to a world where CHPL_LAUNCHER
would trigger the "multilocale" build but that the user wouldn't need CHPL_COMM=gasnet
, and GASNET_MASTERIP
is something gasnet uses itself (rather than something we added, right?)
It does seem likely that GASNET_MASTERIP
entirely overlaps with the cases we would be worried about, though.
Yeah, we didn't add it.
we will likely move to a world where CHPL_LAUNCHER would trigger the "multilocale" build
If that's far in the future, could we use GASNET_MASTERIP for the current time?
Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.
Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.
I'd be down for that. It also seems like that would resolve Greg's desire for the host environment variable.
I'm not following the technical side of this conversation very well, but from the design side, I like Michael's suggestion:
Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.
and it feels resonant with other things that we try not to make terribly GASNet-specific while still being GASNet-compatible. That said, I think @ronawho and @gbtitus know the runtime variables the best, so would defer to them if they disagreed.
Resolved by #13543
Using the node's hostname when connecting the ZMQ sockets under the covers is necessary on supercomputers and clusters, but is only sometimes successful when used from desktops and laptops. In those cases, the connection will remain within the same machine (since we don't spawn the server to any other machine unless there's a launcher provided), so we should use either
localhost
or0.0.0.0
(the latter being more likely to "just work" and not cause problems for the user that we don't anticipate, but may more widely expose the connection).Someone should correct me if I am wrong, but I think we can't as easily detect when we are on a cluster, so having the compiler/runtime compute this information for itself may be impossible. With that in mind, we should choose a reasonable default and then allow the user to explicitly override that default.
My proposal:
0.0.0.0
everywhere else--library-ml-connection
and environment variableCHPL_LIB_ML_CONN
, with values ofL
andH
(for "local" and "host")