Closed eschnett closed 10 years ago
We should rather change --hpx:print-bind
to report the actual bindings...
If --hpx:print-bind
doesn't report the actual bindings, there is a severe bug somewhere between setting the affinity masks and binding them to the cores in use.
Either we have that bug and it needs to be fixed or we can close this ticket. Do we have any evidence for such a problem?
I'll investigate this issue on Wednesday. I thought i fixed those problems. Will check on various machines with different CPUs. I am currently using hwloc 1.7.2. Might be that earlier hwloc versions have a bug.
I can not reproduce this problem. Which version of hwloc are you using?
The issue was originally that --hpx:print-bind examines the command line options, creates an HPX-internal representation of these, and then outputs these. It did not call hwloc_get_cpubind to find out the actual bindings. This led to several errors in the past, since what hwloc_set_cpubind actually did was different from what was reported.
I thus request that the code should call hwloc_get_cpubind to find out the actual bindings, and then report these.
The only call to hwloc_get_cpubind is in the "tests" directory. I thus assume that hpx:print-bind does not actually call hwloc_get_cpubind.
One a phone call this past Wednesday, Hartmut suggested to revisit this issue once we had a new case where --hpx:print-bind outputs wrong information.
There is now such a case; see #981.
The current code still doesn't use hwloc_get_cpubind to output the actual bindings. As before, only HPX's view of the world is output. Given that view was wrong multiple times in the past weeks, I still strongly suggest to use hwloc_get_cpubind to query the actual CPU bindings, and to output these. Errors in CPU bindings are difficult to detect by an unsuspecting user, and tools such as --hpx:print-bind must be reliable.
watching
Am 29.10.2013 14:46 schrieb "Erik Schnetter" notifications@github.com:
The current code still doesn't use hwloc_get_cpubind to output the actual bindings. As before, only HPX's view of the world is output.
That's not entirely true. The commit I made now uses the exact same masks which are used to bind the threads. I ditched the code which did the whole command line parsing and lead to errors. The only thing that could lead to a wrong output is that hwloc fails to bind the threads correctly. I could add code which queries the current binding again but I don't see any additional value in that.
Given that view was wrong multiple times in the past weeks, I still strongly suggest to use hwloc_get_cpubind to query the actual CPU bindings, and to output these. Errors in CPU bindings are difficult to detect by an unsuspecting user, and tools such as --hpx:print-bind must be reliable.
I agree that there were bugs in the way we reported and calculated the thread affinities. This should be detectable now for the reasons described above.
As far as I can tell, we have a reliable solution right now. It is not 100% fool proof for future changes. Will move the final resolution to 1.0.0.
I would like to have an option that make HPX applications report the actual hwloc bindings used. This should use hwloc to read the bindings from the OS for each thread.