Open abouteiller opened 6 years ago
Original comment by Thomas Herault (Bitbucket: herault, GitHub: therault).
Tried with hwloc cross-compiled to support bluegene/Q (see https://www.open-mpi.org/projects/hwloc/doc/v1.11.6/a00305.php#faq_bgq): problem still persists. Investigating.
W@00000 Oversubscription on core 0 detected
W@00001 Oversubscription on core 0 detected
W@00001 Couldn't bind to cpuset 0x0000000c
W@00002 Oversubscription on core 0 detected
W@00003 Oversubscription on core 0 detected
W@00002 Couldn't bind to cpuset 0x00000030
W@00003 Couldn't bind to cpuset 0x000000c0
W@00001 parsec_hwloc: couldn't bind to cpuset 0x0000000c
W@00000 parsec_hwloc: couldn't bind to cpuset 0x00000003
W@00001 Core binding on node -1 failed
W@00000 Core binding on node -1 failed
BG/Q systems have been or are being decommissioned and modern systems have no problem using hwloc. Should we close?
Original report by Thomas Herault (Bitbucket: herault, GitHub: therault).
Recent work on the BlueGene/Q system showed that the binding capability is limited.
Binding using hwloc (or other options) on Mira / Cetus fail with the current code.
Reporting of binding is wrong on that system, because hw threads are not identified by a single identifier
The proposition is to extend the code by changing the type of bindto parameter. Multiple approaches:
force dependence on hwloc and use hwloc structures to describe / assign binding. PRO: the work is done, CON: what if it is complicated to compile hwloc on the target architecture?
have target-architecture-dependent structures to describe the binding. PRO: there are only two cases to manage today (cpuset or BG/Q). CON: this is code bloating.
define a generic way of describing / assigning binding, and have multiple interfaces (if we are on BG/Q, if we have hwloc, etc...). PRO: might be the cleanest. CON: this is in a way duplicating work done in HWLOC.