chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 421 forks source link

Design environment variable/flag to control multilocale library connection style #13425

Closed lydia-duncan closed 5 years ago

lydia-duncan commented 5 years ago

Using the node's hostname when connecting the ZMQ sockets under the covers is necessary on supercomputers and clusters, but is only sometimes successful when used from desktops and laptops. In those cases, the connection will remain within the same machine (since we don't spawn the server to any other machine unless there's a launcher provided), so we should use either localhost or 0.0.0.0 (the latter being more likely to "just work" and not cause problems for the user that we don't anticipate, but may more widely expose the connection).

Someone should correct me if I am wrong, but I think we can't as easily detect when we are on a cluster, so having the compiler/runtime compute this information for itself may be impossible. With that in mind, we should choose a reasonable default and then allow the user to explicitly override that default.

My proposal:

lydia-duncan commented 5 years ago

@gbtitus, @mppf and @ronawho - I would be especially curious to hear your thoughts on this

lydia-duncan commented 5 years ago

(also, feel free to correct any errors in this, I might have mischaracterized something)

gbtitus commented 5 years ago

Is it possible that someday we might want to do this same sort of thing between a client running on one node of an IB cluster and a server running on a number of other nodes of that cluster? Or even, between a client running on a laptop and a server running on a multi-node AWS instance? The latter might be a stretch at this point, but the former could happen today, for example if someone wanted to use a lighter weight server backend during development. If so, instead of differentiating based on Cray versus not-Cray would it make sense to differentiate based on using a launcher versus not using a launcher?

Separately, might it make sense to add a CHPL_RT_LIB_ML_HOST runtime environment variable, to allow users to specify a connection host? Then could we default to 0.0.0.0 (with the associated connection exposure) but they could direct us to use something else if that was a concern for them. It might even let them do things like ssh-tunneling the connection.

mppf commented 5 years ago

It just seems the same as the need for export GASNET_MASTERIP=127.0.0.1. When a system has multiple interfaces, it doesn't always work right to listen on a port without specifying which interface to use.

I'm not sure how to differentiate the different cases but I agree with Greg that Cray vs Non-Cray is probably not the right way.

lydia-duncan commented 5 years ago

I believe we always have CHPL_LAUNCHER set to something other than none when CHPL_COMM != none, unless the user has explicitly set it otherwise. So unless there are certain launcher settings that only come up when on a system that should just talk to itself (which there may be, I'm not as familiar with these), I don't think we'll be able to determine when to use which style.

gbtitus commented 5 years ago

I believe we always have CHPL_LAUNCHER set to something other than none when CHPL_COMM != none, ...

We should have it set whenever we need to use a launcher. For example, we always have CHPL_LAUNCHER set on Cray XC systems if we want to run on the compute nodes, even with CHPL_COMM=none. It also should be possible to set CHPL_LAUNCHER=slurm-srun with CHPL_COMM=none on a vanilla linux64 cluster with a slurm WLM and thus have single-node Chapel programs launched into slurm jobs automagically. Or at least, if that doesn't work today it shouldn't be hard to get there.

lydia-duncan commented 5 years ago

I would be okay with having a set of values for CHPL_LAUNCHER that would trigger hostname use, while the rest default to the other setting, but I would need some help determining which ones those are.

I know at least amudprun can fail me in some cases and work in others, so I would put that on the "not" list.

mppf commented 5 years ago

Supposing the hostname is usually what we want, would just going off of GASNET_MASTERIP work to find the machine-local (testing) configurations? (Don't you have to set that anyway?)

gbtitus commented 5 years ago

... would just going off of GASNET_MASTERIP work ...

I only set that on my MacBook (OS X), I believe to avoid problems when VPN'ed. I vaguely recall needing to set it on some other system(s) in the distant past, but don't recall the details. I don't currently set it other than on my MacBook.

lydia-duncan commented 5 years ago

I do have to have that set on the machine where I had problems. I think my only argument against that is that we will likely move to a world where CHPL_LAUNCHER would trigger the "multilocale" build but that the user wouldn't need CHPL_COMM=gasnet, and GASNET_MASTERIP is something gasnet uses itself (rather than something we added, right?)

lydia-duncan commented 5 years ago

It does seem likely that GASNET_MASTERIP entirely overlaps with the cases we would be worried about, though.

mppf commented 5 years ago

Yeah, we didn't add it.

we will likely move to a world where CHPL_LAUNCHER would trigger the "multilocale" build

If that's far in the future, could we use GASNET_MASTERIP for the current time?

Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.

lydia-duncan commented 5 years ago

Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.

I'd be down for that. It also seems like that would resolve Greg's desire for the host environment variable.

bradcray commented 5 years ago

I'm not following the technical side of this conversation very well, but from the design side, I like Michael's suggestion:

Another option would be to make our own variable like GASNET_MASTERIP, e.g. CHPL_LAUNCH_MASTERIP and set GASNET_MASTERIP based upon it.

and it feels resonant with other things that we try not to make terribly GASNet-specific while still being GASNet-compatible. That said, I think @ronawho and @gbtitus know the runtime variables the best, so would defer to them if they disagreed.

lydia-duncan commented 5 years ago

Resolved by #13543