Two threads are slower than one

eschnett commented 11 years ago

I have a simple benchmark that runs a home-grown DGEMM on a 2000^2 matrix in a single thread. I build HPX in the "Release" configuration, and add "-Ofast -march=native" on an Ubuntu x86-64 system. I use the MPI parcelport, and set HPX_HAVE_PARCELPORT_TCPIP=0 while running. My machine has 8 cores with hyperthreading enabled.

When using a single thread (--hpx:threads=1), the benchmark runs in 8.7 seconds.

When using two threads (--hpx:threads=2), the benchmark runs in 14.0 seconds.

The second thread is idle; I never assign work to it. I see that the HPX process nevertheless uses 200% CPU time. I assume that the idle thread sits in a busy-waiting loop waiting for work, which should be the "right thing" to do in my case.

This issue may be caused by both threads running on the same core (which has two PUs, in hwloc lingo).

When there are more cores than threads, then each thread should run on a different core, ignoring hyperthreads.
When there are more threads than cores, then busy-waiting is not a good idea.

I have also run this benchmark with 2 localities, using the MPI parcelport, using 1 thread on each locality (mpirun -np 2, --hpx:threads=1, still a single node). Strangely, this uses about 50% CPU time on both of the HPX processes -- why not 100%? This configuration takes 17.5 seconds to finish; even slower!

Having additional idle threads should not reduce performance so much.

eschnett commented 11 years ago

I have used hwloc to output CPU affinities for the OS threads, and this indicates that no affinities are set. Is this so? I thought that HPX required hwloc to be able to set affinities?

hkaiser commented 11 years ago

What platform do you run on? Could you make available the HPX log of running the application?

eschnett commented 11 years ago

This is an Ubuntu Linux system, an x86-64 workstation with 8 cores on 4 sockets.

I used the following command:

mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 1 ./bin/block_matrix --hpx:threads=1 2>&1 | tee block_matrix.out

Stdout and the HPX logs are at https://www.dropbox.com/sh/1god6lkxf4lc9b2/cgps_TCmxP.

sithhell commented 11 years ago

Just for the record. Everything works as expected here. As usual, this is more a documentation bug. The defaults, and the possible implications should be handled in more detail. The strange behavior with the MPI parcelport is handled in #421.

hkaiser commented 11 years ago

Thanks Eric. Could you add the output of lstopo here as well?

eschnett commented 11 years ago

$ lstopo --of console
Machine (24GB)
  NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#12)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#14)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#16)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#18)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
      PU L#8 (P#8)
      PU L#9 (P#20)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
      PU L#10 (P#10)
      PU L#11 (P#22)
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
      PU L#12 (P#1)
      PU L#13 (P#13)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
      PU L#14 (P#3)
      PU L#15 (P#15)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
      PU L#16 (P#5)
      PU L#17 (P#17)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
      PU L#18 (P#7)
      PU L#19 (P#19)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
      PU L#20 (P#9)
      PU L#21 (P#21)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#23)
  HostBridge L#0
    PCIBridge
      PCI 14e4:163a
        Net L#0 "eth0"
      PCI 14e4:163a
        Net L#1 "eth1"
    PCIBridge
      PCIBridge
        PCIBridge
          PCI 10de:06d2
    PCIBridge
      PCI 1000:0072
        Block L#2 "sdc"
        Block L#3 "sdd"
    PCIBridge
      PCI 102b:0532
    PCI 8086:2926

eschnett commented 11 years ago

With

mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 1 ./bin/block_matrix \
    --hpx:threads=2 --hpx:pu-step=2

the run time is 8.6 seconds, as it should be.

eschnett commented 11 years ago

Also, the options

mpirun -x HPX_HAVE_PARCELPORT_TCPIP=0 -x HPX_LOGLEVEL=5 -np 2 ./bin/block_matrix \
    --hpx:1:pu-offset=2 --hpx:threads=1 --hpx:pu-step=2

run at full speed with two localities.

STEllAR-GROUP / hpx

Two threads are slower than one #885