STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.53k stars 435 forks source link

Would like option to report hwloc bindings #973

Closed eschnett closed 10 years ago

eschnett commented 11 years ago

I would like to have an option that make HPX applications report the actual hwloc bindings used. This should use hwloc to read the bindings from the OS for each thread.

hkaiser commented 11 years ago

We should rather change --hpx:print-bind to report the actual bindings...

sithhell commented 11 years ago

If --hpx:print-bind doesn't report the actual bindings, there is a severe bug somewhere between setting the affinity masks and binding them to the cores in use.

hkaiser commented 11 years ago

Either we have that bug and it needs to be fixed or we can close this ticket. Do we have any evidence for such a problem?

sithhell commented 11 years ago

I'll investigate this issue on Wednesday. I thought i fixed those problems. Will check on various machines with different CPUs. I am currently using hwloc 1.7.2. Might be that earlier hwloc versions have a bug.

sithhell commented 11 years ago

I can not reproduce this problem. Which version of hwloc are you using?

eschnett commented 11 years ago

The issue was originally that --hpx:print-bind examines the command line options, creates an HPX-internal representation of these, and then outputs these. It did not call hwloc_get_cpubind to find out the actual bindings. This led to several errors in the past, since what hwloc_set_cpubind actually did was different from what was reported.

I thus request that the code should call hwloc_get_cpubind to find out the actual bindings, and then report these.

The only call to hwloc_get_cpubind is in the "tests" directory. I thus assume that hpx:print-bind does not actually call hwloc_get_cpubind.

eschnett commented 11 years ago

One a phone call this past Wednesday, Hartmut suggested to revisit this issue once we had a new case where --hpx:print-bind outputs wrong information.

There is now such a case; see #981.

eschnett commented 11 years ago

The current code still doesn't use hwloc_get_cpubind to output the actual bindings. As before, only HPX's view of the world is output. Given that view was wrong multiple times in the past weeks, I still strongly suggest to use hwloc_get_cpubind to query the actual CPU bindings, and to output these. Errors in CPU bindings are difficult to detect by an unsuspecting user, and tools such as --hpx:print-bind must be reliable.

pagrubel commented 11 years ago

watching

sithhell commented 11 years ago

Am 29.10.2013 14:46 schrieb "Erik Schnetter" notifications@github.com:

The current code still doesn't use hwloc_get_cpubind to output the actual bindings. As before, only HPX's view of the world is output.

That's not entirely true. The commit I made now uses the exact same masks which are used to bind the threads. I ditched the code which did the whole command line parsing and lead to errors. The only thing that could lead to a wrong output is that hwloc fails to bind the threads correctly. I could add code which queries the current binding again but I don't see any additional value in that.

Given that view was wrong multiple times in the past weeks, I still strongly suggest to use hwloc_get_cpubind to query the actual CPU bindings, and to output these. Errors in CPU bindings are difficult to detect by an unsuspecting user, and tools such as --hpx:print-bind must be reliable.

I agree that there were bugs in the way we reported and calculated the thread affinities. This should be detectable now for the reasons described above.

sithhell commented 11 years ago

As far as I can tell, we have a reliable solution right now. It is not 100% fool proof for future changes. Will move the final resolution to 1.0.0.