STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.52k stars 429 forks source link

Support for Windows HPC Pack #5903

Open Neumann-A opened 2 years ago

Neumann-A commented 2 years ago

Any plans to support HPC Pack on windows?

I already extracted the following env variables:

//CCP_TASKID=2
//CCP_TASKCONTEXT=15317.26014
//CCP_CLUSTER_NAME=HEAD
//CCP_JOBID=15317
//CCP_DATA=C:\Program Files\Microsoft HPC Pack 2016\Data\
//CCP_JOBNAME=set test
//CCP_NODES_CORES=2 NODE02 80 NODE03 80 // This is <NodeCount> <NodeName1> <NodeCores1> <NodeName2> <NodeCores2> ....
//CCP_JOBTYPE=Batch
//CCP_RETRY_COUNT=0
//CCP_HOME=C:\Program Files\Microsoft HPC Pack 2016\
//CCP_JOBTEMPLATE=Default
//CCP_MPI_WORKDIR=%SCRATCH_DIR%\neumann\bin
//CCP_NUMCPUS=160
//CCP_RERUNNABLE=False
//CCP_LOGROOT_USR=%LOCALAPPDATA%\Microsoft\Hpc\LogFiles\
//CCP_TASKINSTANCEID=0
//CCP_PREVIOUS_JOB_ID=14868
//CCP_SCHEDULER=Head
//CCP_STDOUT=C:\Scratch\neumann\logs\out_15317_2.txt
//CCP_NODES=2 NODE02 80 NODE03 80
//CCP_SERVICEREGISTRATION_PATH=CCP_REGISTRATION_STORE;\\HEAD\HpcServiceRegistration
//CCP_SERVICEREGISTRATION_PATH=CCP_REGISTRATION_STORE;\\HEAD\HpcServiceRegistration
//CCP_TASKSYSTEMID=26014
//CCP_OWNER_SID=S-1-5-21-623046577-3442102450-3314793965-5699
//CCP_WORKDIR=C:\Scratch\neumann\bin
//CCP_STDERR=C:\Scratch\neumann\logs\out_15317_2.txt
//CCP_CONNECTIONSTRING=Head
//CCP_EXCLUSIVE=False
//CCP_LOGROOT_SYS=C:\Program Files\Microsoft HPC Pack 2016\Data\LogFiles\
//CCP_ENVLIST=CCP_TASKSYSTEMID,HPC_RUNTIMESHARE,CCP_TASKINSTANCEID,CCP_JOBID,CCP_TASKID
//CCP_RUNTIME=2147483647
//CCP_MPI_NETMASK=141.83.112.0/255.255.255.0
//CCP_COREIDS=0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 // Cores available to the job.
hkaiser commented 2 years ago

What would that entail? What does it mean 'supporting Windows HPC Pack'?

Neumann-A commented 2 years ago

What does it mean 'supporting Windows HPC Pack'?

from https://docs.microsoft.com/en-us/previous-versions/orphan-topics/ws.10/ff919691(v=ws.10)?redirectedfrom=MSDN

Microsoft® HPC Pack provides an integrated application platform for running, managing, and developing parallel computing applications. HPC Job Manager provides your primary interface for submitting and monitoring jobs on a cluster.

I just want HPX to deal with the thread binding to the assigned cores via CCP_COREIDS. Don't know if a nodelist ist needed via CCP_NODES/CCP_NODES_CORES ?

What would that entail?

probably just need to add another environment class into libs/core/batch_environments

hkaiser commented 2 years ago

What would that entail?

probably just need to add another environment class into libs/core/batch_environments

While I can blindly try adding such a binding, I don't have a way to test this. Would you be able to help?

hkaiser commented 2 years ago

FWIW, the environment variables are described here: https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-hpc-server-2008r2/gg286970(v=ws.10)

Neumann-A commented 2 years ago

While I can blindly try adding such a binding, I don't have a way to test this. Would you be able to help?

Which example would I have to run for it to be considered tested ?

hkaiser commented 2 years ago

hello_world_distributed always works. To test the core bindings, use the --hpx:print-bind command line option.

Neumann-A commented 2 years ago

hello_world_distributed always works.

What is the minimal example/command line to let this run on two localities? After fighting ip4 vs ip6 (hpx doesn't seem to support raw ip6 adresses on the cmd line) i can get it to show on open port and get a remote to send on that port but somehow they don't communicate. heartbeat/heartbeat_console show communication (inspected via wireshark) although there is seemingly no console output (?).

So I have https://github.com/Neumann-A/hpx/pull/1/files to support HPCPack. The pu binding works but having two localities is still being under test until I figure out how the above works.

hkaiser commented 2 years ago

Here is an explanation of how to manually run HPX applications on more than one locality (outside of batch environments): https://stackoverflow.com/a/35381710/269943.

I have not tried using ipv6 addresses on the command line. I'd consider it a bug if those don't work, however.

Neumann-A commented 2 years ago

Here is an explanation of how to manually run HPX applications on more than one locality (outside of batch environments): https://stackoverflow.com/a/35381710/269943.

Ok that is also what I found and tried in addition to a lot of additional flags. Seems like hello_world_distributed is bugged and doesn't want to start running (or loc 0 doesn't want to communicate with loc 1 which sends the requests). The heartbeat_console (loc 0; Why is this called console and not server?) and heartbeat (loc 1;prints timestamps.) seems to work, however.

heartbeat debug-clp:

HPC Pack nodelist: 2 ILUINRANDIR 8 ILUIN-DELTA3 4
Localities: 2
Threads: 4
Name: ILUIN-DELTA3
batch_name: HPCPack
num_threads: 4
node_num_: 1
num_localities: 2
got node list
extracted: 'ILUINRANDIR'
incrementing agas_node_num
extracted: 'ILUIN-DELTA3'
incrementing agas_node_num
using AGAS host: 'ILUINRANDIR' (node number 0)
Nodes from nodelist:
ILUIN-DELTA3: 1 (192.168.0.22:0)
ILUINRANDIR: 1 (192.168.0.175:0)
agas host_name: ILUINRANDIR
asio host_name: Iluin-Delta3
host_name: Iluin-Delta3
resolved: 'Iluin-Delta3' to: 192.168.0.22
resolved: 'ILUINRANDIR' to: 192.168.0.175
Configuration before runtime start:
-----------------------------------
hpx.run_hpx_main!=1
hpx.use_process_mask!=0
hpx.nodes!=ILUINRANDIR ILUIN-DELTA3
hpx.expect_connecting_localities=1
hpx.locality!=1
hpx.node!=1
hpx.scheduler=local-priority-fifo
hpx.affinity=pu
hpx.pu_step=1
hpx.pu_offset=0
hpx.numa_sensitive=0
hpx.bind!=balanced
hpx.os_threads=4
hpx.cores=4
hpx.parcel.address=192.168.0.22
hpx.parcel.port=7911   <---- is this port increase intended?
hpx.agas.address=192.168.0.175
hpx.agas.port=7910
hpx.localities!=2
hpx.runtime_mode=worker
-----------------------------------

(prints stuff like: /threadqueue{locality#0/total}/length,144,76.460994[s],0 )

heartbeat_console:

HPC Pack nodelist: 2 ILUINRANDIR 8 ILUIN-DELTA3 4
Localities: 2
Threads: 8
Name: ILUINRANDIR
batch_name: HPCPack
num_threads: 8
node_num_: 0
num_localities: 2
got node list
extracted: 'ILUINRANDIR'
incrementing agas_node_num
extracted: 'ILUIN-DELTA3'
incrementing agas_node_num
using AGAS host: 'ILUINRANDIR' (node number 0)
Nodes from nodelist:
ILUIN-DELTA3: 1 (192.168.0.22:0)
ILUINRANDIR: 1 (192.168.0.175:0)
agas host_name: ILUINRANDIR
asio host_name: Iluinrandir
host_name: Iluinrandir
resolved: 'Iluinrandir' to: 192.168.0.175
resolved: 'ILUINRANDIR' to: 192.168.0.175
Configuration before runtime start:
-----------------------------------
hpx.expect_connecting_localities=1
hpx.use_process_mask!=0
hpx.nodes!=ILUINRANDIR ILUIN-DELTA3
hpx.expect_connecting_localities=1
hpx.locality!=0
hpx.node!=0
hpx.scheduler=local-priority-fifo
hpx.affinity=pu
hpx.pu_step=1
hpx.pu_offset=0
hpx.numa_sensitive=0
hpx.bind!=balanced
hpx.os_threads=8
hpx.cores=8
hpx.parcel.address=192.168.0.175
hpx.parcel.port=7910
hpx.agas.address=192.168.0.175
hpx.agas.port=7910
hpx.agas.service_mode=bootstrap
hpx.localities!=2
hpx.runtime_mode=console
-----------------------------------

(just waits 600s and prints a few .....)

hello_world_distributed (loc 1):

HPC Pack nodelist: 2 ILUINRANDIR 8 ILUIN-DELTA3 4
Localities: 2
Threads: 4
Name: ILUIN-DELTA3
batch_name: HPCPack
num_threads: 4
node_num_: 1
num_localities: 2
got node list
extracted: 'ILUINRANDIR'
incrementing agas_node_num
extracted: 'ILUIN-DELTA3'
incrementing agas_node_num
using AGAS host: 'ILUINRANDIR' (node number 0)
Nodes from nodelist:
ILUIN-DELTA3: 1 (192.168.0.22:0)
ILUINRANDIR: 1 (192.168.0.175:0)
agas host_name: ILUINRANDIR
asio host_name: Iluin-Delta3
host_name: Iluin-Delta3
resolved: 'Iluin-Delta3' to: 192.168.0.22
resolved: 'ILUINRANDIR' to: 192.168.0.175
Configuration before runtime start:
-----------------------------------
hpx.commandline.allow_unknown=1
hpx.commandline.aliasing=0
hpx.use_process_mask!=0
hpx.nodes!=ILUINRANDIR ILUIN-DELTA3
hpx.expect_connecting_localities=1
hpx.locality!=1
hpx.node!=1
hpx.scheduler=local-priority-fifo
hpx.affinity=pu
hpx.pu_step=1
hpx.pu_offset=0
hpx.numa_sensitive=0
hpx.bind!=balanced
hpx.os_threads=4
hpx.cores=4
hpx.parcel.address=192.168.0.22
hpx.parcel.port=7911
hpx.agas.address=192.168.0.175
hpx.agas.port=7910
hpx.localities!=2
hpx.runtime_mode=worker
-----------------------------------

hello_world_distributed (loc 0):

HPC Pack nodelist: 2 ILUINRANDIR 8 ILUIN-DELTA3 4
Localities: 2
Threads: 8
Name: ILUINRANDIR
batch_name: HPCPack
num_threads: 8
node_num_: 0
num_localities: 2
got node list
extracted: 'ILUINRANDIR'
incrementing agas_node_num
extracted: 'ILUIN-DELTA3'
incrementing agas_node_num
using AGAS host: 'ILUINRANDIR' (node number 0)
Nodes from nodelist:
ILUIN-DELTA3: 1 (192.168.0.22:0)
ILUINRANDIR: 1 (192.168.0.175:0)
agas host_name: ILUINRANDIR
asio host_name: Iluinrandir
host_name: Iluinrandir
resolved: 'Iluinrandir' to: 192.168.0.175
resolved: 'ILUINRANDIR' to: 192.168.0.175
Configuration before runtime start:
-----------------------------------
hpx.commandline.allow_unknown=1
hpx.commandline.aliasing=0
hpx.use_process_mask!=0
hpx.nodes!=ILUINRANDIR ILUIN-DELTA3
hpx.expect_connecting_localities=1
hpx.locality!=0
hpx.node!=0
hpx.scheduler=local-priority-fifo
hpx.affinity=pu
hpx.pu_step=1
hpx.pu_offset=0
hpx.numa_sensitive=0
hpx.bind!=balanced
hpx.os_threads=8
hpx.cores=8
hpx.parcel.address=192.168.0.175
hpx.parcel.port=7910
hpx.agas.address=192.168.0.175
hpx.agas.port=7910
hpx.agas.service_mode=bootstrap
hpx.localities!=2
hpx.runtime_mode=console
-----------------------------------

Since I don't see a difference in those two setups and and heartbeat seems to work while hello_world_distributed does not. I have to assume it is bugged somehow.

I have not tried using ipv6 addresses on the command line. I'd consider it a bug if those don't work, however.

Probably due to: https://github.com/STEllAR-GROUP/hpx/blob/0b1a7e3d904b9b3fd228c3bd56430d80f4d23050/libs/core/asio/src/asio_util.cpp#L323 should probably be a find_last_of instead

Another question: Should ctrl+c just print information instead of aborting the application? How do i abort an HPX application then?

hkaiser commented 2 years ago

I'm not aware of distributed_hello_world being broken, even more as it relies on built-in facilities to establish the connection between the different localities. How can I reproduce your issues? The output generated by the two localities looks ok to me. Does it simply hang during startup?

The heartbeat example uses a different startup mode compared to hello_world. It demonstrates how a locality can connect to a running application after the fact. I'm glad that this seems to work for you.

I have not tried using ipv6 addresses on the command line. I'd consider it a bug if those don't work, however.

Probably due to:

https://github.com/STEllAR-GROUP/hpx/blob/0b1a7e3d904b9b3fd228c3bd56430d80f4d23050/libs/core/asio/src/asio_util.cpp#L323

should probably be a find_last_of instead

Yeah, this could very well be the culprit. Thanks.

Another question: Should ctrl+c just print information instead of aborting the application? How do i abort an HPX application then?

The ctrl+c on windows is broken, we're aware of this. It is supposed to stop the execution, but it doesn't.

hkaiser commented 2 years ago

What I might be able to do to debug your problem is to explicitly set the environment variables HPC Pack normally sets. Can you give me those for both localities?

Neumann-A commented 2 years ago

Does it simply hang during startup?

simply hangs. If I start with --hpx:debug-hpx-log --hpx:debug-agas-log --hpx:debug-parcel-log the last line shown

What I might be able to do to debug your problem is to explicitly set the environment variables HPC Pack normally sets. Can you give me those for both localities?

I also only simulated it by setting CCP_NODES to 2 ILUINRANDIR 8 ILUIN-DELTA3 4 where ILUINRANDIR and ILUIN-DELTA3 are the COMPUTERNAME env variable of the two devices I used. I had problems with these because locally they resolve to an IP6 adress while remotely they are seen as an IP4 adress

last line i see in loc 0 using --hpx:debug-clp --hpx:debug-hpx-log --hpx:debug-agas-log --hpx:debug-parcel-log is: (T00000000/----------------.----/----------------) P--------/----------------.---- 00:00.38.093 [0000000000000004][AGAS] <info> primary_namespace::allocate, count(65535), lower({000000015f000001, 0000000000001001}), upper({000000015f000001, 0000000000010fff}), response(success)

Here is the debugger attached: image Seems like it is hanging on a lock/condition variable.

two other threads seem to be hanging at: image

Be aware the code was compiled using clang-cl 14.0.6 since I wasn't able to get it to compile with VS 17.3.1 for some reason. I also completely switched to static builds since it threw a failure about a missing entry point at me: image

hkaiser commented 2 years ago

I also only simulated it by setting CCP_NODES to 2 ILUINRANDIR 8 ILUIN-DELTA3 4 where ILUINRANDIR and ILUIN-DELTA3 are the COMPUTERNAME env variable of the two devices I used. I had problems with these because locally they resolve to an IP6 adress while remotely they are seen as an IP4 adress

When I do this, everything works for me (using your PR).

The stack-backtraces indicate that one of the localities is waiting for the other to connect which makes me think that the hostnames/ports are not correctly set. Could you verify using netstat -a (on both hosts) that they listen at the expected ports?

I can't explain the undefined symbol. Are you trying to use that function from outside of HPX (the symbol is not exported)?

Neumann-A commented 2 years ago

The ctrl+c on windows is broken, we're aware of this. It is supposed to stop the execution, but it doesn't.

replace the return TRUE; with a break; from the windows termination handler (see #5994).

Your SIGABRT handler also seem to be UB according to https://en.cppreference.com/w/cpp/utility/program/signal. (no extern "C" accessing a static storage duration variable. )

I can't explain the undefined symbol. Are you trying to use that function from outside of HPX (the symbol is not exported)?

no just the examples like hello_world_distributed.

I'll investigate later if I figure out the ipv6 issues again. Somehow it wants to bind to the ip6 adresses again which is simply wrong. Don't really know what changed compared to yesterday though.

Neumann-A commented 2 years ago

hmm so the IP4/6 issues is a bit more intricate. If I just use the node/computer names it will always use ip6 if it is local but ip4 for external names so I need some consistent programmatic way to force either ip4 or ip6. Is there an option in the TCP parcelport to select one or the other?

Neumann-A commented 2 years ago

hmm after the IPv4 force hello_world_distributed seems to connect but i get: (T--------/----------------.----/----------------) P--------/----------------.---- 13:22.59.334 [0000000000000001][ PT] <debug> ({0000000100000000, 0000000000000000}:({0000000100000000, 0000000000000000}:component_invalid[-1]:0000000000000000):register_worker_action) + another line with binary output in the agas debug output of the worker

Neumann-A commented 2 years ago

Fun stuff ..... after fighting more with hello_world_distributed and two nodes the result is: It works: if the console node is running the debug build and the worker node is running the relwithdebinfo build. It probably also runs if the worker nodes runs the debug build but running the console node with the relwithdebinfo build does not work. So optimization are somehow screwing up the communication.

Any way to further debug that? I tried setting a breakpoint in big_boot_barrier::notify() but that was never called in the relwithdebinfo build.

hkaiser commented 2 years ago

Could it be that the hostnames are resolved to ipv4 in one place and ipv6 in another place? The only reason why the release build works, and the debug build doesn't could be that some assert is not compiled into the executable, and I know for a fact that we compare resolved ip addresses in asserts (for instance here). In this context comparing ipv4 with ipv6 will fail.

Neumann-A commented 2 years ago

The only reason why the release build works, and the debug build doesn't could be that some assert is not compiled into the executable

It is the other way around. Debug works, Release fails. What I can see via netstat and wireshark is that the worker tries to connect but never gets a response from the release build console runner (although it is listening according to netstat and the fact the the debug build of the console runner works. ).

Neumann-A commented 2 years ago

Hmm what works:

Running two localities one the same PC. Running debug version on the console locality with two separate pcs works. (so tcp communication in general works fine. no firewall blocking) Otherwise it stalls in the big_bootstrap_barrier waiting for connections. The worker is sending but somehow the reciever doesn't want to ACK that although it is listening. (checked with netstat / wireshark)

From my point the parcelport init hangs. I see that the debug version generates more named threads in the debug while in the RelWithDebInfo build I only get the parcel-thread-tcp threads.

In which thread should io_services.run() / GetQueuedCompletionStatus be executed in ? I have the feeling those two are somehow scheduled wrongly.

Neumann-A commented 2 years ago

So ok after figuring out that the reason for the strange connection behavior was the windows firewall, I got hello_world_distributed running. Somehow the firewall could distinguish between the debug/release executable and it blocked the release one because i started with the debug one. After deleting the firewall app rule for hello_world_distributed it executed as expected (I already setup global rules for port 7910 so the app rule was never necessary)

What I now observe is the following:

Neumann-A commented 2 years ago

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/batch/batch-compute-node-environment-variables.md

Uses pure ipv4 addresses in CCP_NODES instead hostnames. Maybe azure is another batch environment to consider supporting?