Closed eschnett closed 11 years ago
I have used hwloc to output CPU affinities for the OS threads, and this indicates that no affinities are set. Is this so? I thought that HPX required hwloc to be able to set affinities?
What platform do you run on? Could you make available the HPX log of running the application?
This is an Ubuntu Linux system, an x86-64 workstation with 8 cores on 4 sockets.
I used the following command:
mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 1 ./bin/block_matrix --hpx:threads=1 2>&1 | tee block_matrix.out
Stdout and the HPX logs are at https://www.dropbox.com/sh/1god6lkxf4lc9b2/cgps_TCmxP.
Just for the record. Everything works as expected here. As usual, this is more a documentation bug. The defaults, and the possible implications should be handled in more detail. The strange behavior with the MPI parcelport is handled in #421.
Thanks Eric. Could you add the output of lstopo
here as well?
$ lstopo --of console
Machine (24GB)
NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#12)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
PU L#2 (P#2)
PU L#3 (P#14)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
PU L#4 (P#4)
PU L#5 (P#16)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
PU L#6 (P#6)
PU L#7 (P#18)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
PU L#8 (P#8)
PU L#9 (P#20)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
PU L#10 (P#10)
PU L#11 (P#22)
NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
PU L#12 (P#1)
PU L#13 (P#13)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
PU L#14 (P#3)
PU L#15 (P#15)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
PU L#16 (P#5)
PU L#17 (P#17)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
PU L#18 (P#7)
PU L#19 (P#19)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
PU L#20 (P#9)
PU L#21 (P#21)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#23)
HostBridge L#0
PCIBridge
PCI 14e4:163a
Net L#0 "eth0"
PCI 14e4:163a
Net L#1 "eth1"
PCIBridge
PCIBridge
PCIBridge
PCI 10de:06d2
PCIBridge
PCI 1000:0072
Block L#2 "sdc"
Block L#3 "sdd"
PCIBridge
PCI 102b:0532
PCI 8086:2926
With
mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 1 ./bin/block_matrix \
--hpx:threads=2 --hpx:pu-step=2
the run time is 8.6 seconds, as it should be.
Also, the options
mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 2 ./bin/block_matrix \
--hpx:1:pu-offset=2 --hpx:threads=1 --hpx:pu-step=2
run at full speed with two localities.
I have a simple benchmark that runs a home-grown DGEMM on a 2000^2 matrix in a single thread. I build HPX in the "Release" configuration, and add "-Ofast -march=native" on an Ubuntu x86-64 system. I use the MPI parcelport, and set HPX_HAVE_PARCELPORT_TCPIP=0 while running. My machine has 8 cores with hyperthreading enabled.
When using a single thread (--hpx:threads=1), the benchmark runs in 8.7 seconds.
When using two threads (--hpx:threads=2), the benchmark runs in 14.0 seconds.
The second thread is idle; I never assign work to it. I see that the HPX process nevertheless uses 200% CPU time. I assume that the idle thread sits in a busy-waiting loop waiting for work, which should be the "right thing" to do in my case.
This issue may be caused by both threads running on the same core (which has two PUs, in hwloc lingo).
I have also run this benchmark with 2 localities, using the MPI parcelport, using 1 thread on each locality (mpirun -np 2, --hpx:threads=1, still a single node). Strangely, this uses about 50% CPU time on both of the HPX processes -- why not 100%? This configuration takes 17.5 seconds to finish; even slower!
Having additional idle threads should not reduce performance so much.