Open stevenrbrandt opened 4 years ago
Ok, I think the main problem is that the way HPX TCP connections work is not documented adequately. Here are the basics:
Every HPX locality needs to listen to a unique TCP/IP address (IP-address/hostname + port number). Most of the time, HPX can make sure this condition is met (in SLURM or PBS batch environments, for instance), but sometimes you need to help by providing the necessary information.
If the HPX localities run on different nodes this is easily achieved, as every locality uses hostname:7910
as their default, which is unique by definition.
If the localities run on the same node, locality zero uses hostname:7910
(again, by default) but all other localities have to be told to use some other port.
Connecting localities use hostname:7909
as their default, which is why you can have one connecting locality without problems.
To have a second locality or more, you will have to make sure they use a unique port, for instance hostname:7908
, etc.
HPX has two command-line options to specify the IP addresses to bind their sockets to: --hpx:hpx
defines the address a locality will use to listen for incoming parcels, and --hpx:agas
defines the address a locality should use to connect to AGAS (usually locality zero).
That said, to run a base locality and two connecting localities on the same node, you could do:
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908
(note, some of the options could be left out to utilize the built-in defaults, but I have listed the full set to clarify things).
By the way, there is also the command-line option --hpx:connect
that can be passed to any locality to instruct it to connect to a running application. IOW, you could do:
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909 --hpx:connect
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908 --hpx:connect
(note, all three launch the same executable) and it should still work.
Problems: (1) Not sure how to use params with the longer version of hpx::init()
This is probably what you're looking for: https://hpx-docs.stellar-group.org/latest/html/libs/init_runtime/api.html?highlight=init_params#_CPPv4N3hpx4initEiPPcRK11init_params
(2) I can only use it to add one locality ./build.Release/bin/physl --connect sleep.p # works ./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly
See my comment above for some explanations.
(3) My guess is you don't want sleep() implemented as a math function, though it works fine.
Correct, I don't think we should do that. We need a simpler way to add primitives (we have one, but I don't like it ;-), so I'll think about it (see https://github.com/STEllAR-GROUP/phylanx/issues/1250).
@hkaiser Note that
./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect
Also exits instantly. Somehow, the hpx arguments are not compatible with the physl arguments. Not sure why.
./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect
I don't know anything about the --connect
option. I'm not sure what you mean.
@hkaiser the connect option was something I added for the PR, so that I could call finalize instead of disconnect, etc.
I now see that the modification to physl wasn't needed, the option --hpx:connect
does it.
OK, I can connect localities, but I cannot use them. So I have a main process which waits for 4 localties, then tries to run a cannon product (which requires 4 localities) using this script can.p
:
define(
cannon,
size,
block(
define(
nl,
num_localities()
),
while(
__lt(nl, 4),
block(
cout(nl),
sleep(1),
store(
nl,
num_localities()
)
)
),
cout("cannon!"),
define(
array1,
random_d(
list(size, size),
find_here(),
num_localities()
)
),
define(
array2,
random_d(
list(size, size),
find_here(),
num_localities()
)
),
cannon_product_d(array1, array2)
)
)
cannon(120)
Then I have some other processes which just run sleep.p
sleep(10)
I then try to orchestrate things by calling this script: run.sh
./build.Release/bin/physl --hpx:ini=hpx.parcel.tcp.enable=1 \
--hpx:threads=2 --hpx:expect-connecting-localities can.p &
sleep 2
echo attach procs
for port in 7913 7911 7912
do
echo PORT $port
./build.Release/bin/physl --hpx:threads=2 --hpx:ini=hpx.parcel.tcp.enable=1 --hpx:hpx=127.0.0.1:$port --hpx:connect sleep.p &
done
while wait
do
sleep 1
done
echo "DONE"
The 4 localities are obtained, but when the cannon product is attempted, the code hangs. Thoughts?
@hkaiser I also attempted to have all localities run the same code, i.e. can.p
. The all print cannon!
and then all hang.
@hkaiser I also attempted to have all localities run the same code, i.e.
can.p
. The all printcannon!
and then all hang.
That's progress, I guess ;-)
Make it possible to connect new resources to a running Phylanx calculation.
This doesn't actually work. I need guidance.
Problems: (1) Not sure how to use params with the longer version of hpx::init() (2) I can only use it to add one locality ./build.Release/bin/physl --connect sleep.p # works ./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly (3) My guess is you don't want sleep() implemented as a math function, though it works fine.