STEllAR-GROUP / phylanx

An Asynchronous Distributed C++ Array Processing Toolkit
Boost Software License 1.0
75 stars 76 forks source link

Connectable #1249

Open stevenrbrandt opened 4 years ago

stevenrbrandt commented 4 years ago

Make it possible to connect new resources to a running Phylanx calculation.

This doesn't actually work. I need guidance.

Problems: (1) Not sure how to use params with the longer version of hpx::init() (2) I can only use it to add one locality ./build.Release/bin/physl --connect sleep.p # works ./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly (3) My guess is you don't want sleep() implemented as a math function, though it works fine.

hkaiser commented 4 years ago

Ok, I think the main problem is that the way HPX TCP connections work is not documented adequately. Here are the basics:

Every HPX locality needs to listen to a unique TCP/IP address (IP-address/hostname + port number). Most of the time, HPX can make sure this condition is met (in SLURM or PBS batch environments, for instance), but sometimes you need to help by providing the necessary information.

If the HPX localities run on different nodes this is easily achieved, as every locality uses hostname:7910 as their default, which is unique by definition.

If the localities run on the same node, locality zero uses hostname:7910 (again, by default) but all other localities have to be told to use some other port.

Connecting localities use hostname:7909 as their default, which is why you can have one connecting locality without problems.

To have a second locality or more, you will have to make sure they use a unique port, for instance hostname:7908, etc.

HPX has two command-line options to specify the IP addresses to bind their sockets to: --hpx:hpx defines the address a locality will use to listen for incoming parcels, and --hpx:agas defines the address a locality should use to connect to AGAS (usually locality zero).

That said, to run a base locality and two connecting localities on the same node, you could do:

./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908

(note, some of the options could be left out to utilize the built-in defaults, but I have listed the full set to clarify things).

By the way, there is also the command-line option --hpx:connect that can be passed to any locality to instruct it to connect to a running application. IOW, you could do:

./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909 --hpx:connect
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908 --hpx:connect

(note, all three launch the same executable) and it should still work.

hkaiser commented 4 years ago

Problems: (1) Not sure how to use params with the longer version of hpx::init()

This is probably what you're looking for: https://hpx-docs.stellar-group.org/latest/html/libs/init_runtime/api.html?highlight=init_params#_CPPv4N3hpx4initEiPPcRK11init_params

(2) I can only use it to add one locality ./build.Release/bin/physl --connect sleep.p # works ./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly

See my comment above for some explanations.

(3) My guess is you don't want sleep() implemented as a math function, though it works fine.

Correct, I don't think we should do that. We need a simpler way to add primitives (we have one, but I don't like it ;-), so I'll think about it (see https://github.com/STEllAR-GROUP/phylanx/issues/1250).

stevenrbrandt commented 4 years ago

@hkaiser Note that

./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect

Also exits instantly. Somehow, the hpx arguments are not compatible with the physl arguments. Not sure why.

hkaiser commented 4 years ago

./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect

I don't know anything about the --connect option. I'm not sure what you mean.

stevenrbrandt commented 4 years ago

@hkaiser the connect option was something I added for the PR, so that I could call finalize instead of disconnect, etc.

stevenrbrandt commented 4 years ago

I now see that the modification to physl wasn't needed, the option --hpx:connect does it.

stevenrbrandt commented 4 years ago

OK, I can connect localities, but I cannot use them. So I have a main process which waits for 4 localties, then tries to run a cannon product (which requires 4 localities) using this script can.p:

define(
    cannon,
    size,
    block(
        define(
            nl,
            num_localities()
        ),
        while(
            __lt(nl, 4),
            block(
                cout(nl),
                sleep(1),
                store(
                    nl,
                    num_localities()
                )
            )
        ),
        cout("cannon!"),
        define(
            array1,
            random_d(
                list(size, size),
                find_here(),
                num_localities()
            )
        ),
        define(
            array2,
            random_d(
                list(size, size),
                find_here(),
                num_localities()
            )
        ),
        cannon_product_d(array1, array2)
    )
)
cannon(120)

Then I have some other processes which just run sleep.p

sleep(10)

I then try to orchestrate things by calling this script: run.sh

./build.Release/bin/physl --hpx:ini=hpx.parcel.tcp.enable=1 \
    --hpx:threads=2 --hpx:expect-connecting-localities can.p &

sleep 2

echo attach procs
for port in 7913 7911 7912
do
    echo PORT $port
    ./build.Release/bin/physl --hpx:threads=2 --hpx:ini=hpx.parcel.tcp.enable=1 --hpx:hpx=127.0.0.1:$port --hpx:connect sleep.p &
done

while wait
do
    sleep 1
done
echo "DONE"

The 4 localities are obtained, but when the cannon product is attempted, the code hangs. Thoughts?

stevenrbrandt commented 4 years ago

@hkaiser I also attempted to have all localities run the same code, i.e. can.p. The all print cannon! and then all hang.

hkaiser commented 4 years ago

@hkaiser I also attempted to have all localities run the same code, i.e. can.p. The all print cannon! and then all hang.

That's progress, I guess ;-)