Open omor1 opened 9 months ago
I can reproduce this issue with the following error:
testing_issue624: /home/qcao3/parsec/parsec/remote_dep_mpi.c:1582: remote_dep_mpi_save_put_cb: Assertion `0 != deps->pending_ack' failed.
And changing the flow order in the task class start
can bypass this issue.
Describe the bug
PaRSEC reduces the overhead of task activations by only sending a single activation message per destination process; if a process needs multiple output data from a task, it is included in the activation tree of the first output processed. Because processes request only the data that they themselves need and that data is fetched from their parents in the activation tree, if that parent does not need a superset of the data required by the child, the broadcast will go awry and the parent process will segfault on accessing the non-existent data.
The current PaRSEC broadcast looks like this:
At Parent in Broadcast Tree
At Child in Broadcast Tree
At Parent in Broadcast Tree
At Child in Broadcast Tree
What this means
Suppose a task executed on process x has two outputs, A and B. A is needed on both processes y and z, but B is needed only on process z. If both 1) output A is ordered before output B and 2) process y is ordered before process z in the computed topology, then:
To Reproduce
Compile and run the following with three processes and using the chain broadcast algorithm.
Expected behavior
PaRSEC should not crash for a valid task graph.
Specifically, the broadcast topology should not be constructed in such an invalid manner. There are several solutions to this, with varying tradeoffs.
Another option is to build a global tree for all activations and send each output data along separate per-output tree. This requires some additional support for handling out-of-order requests for data where e.g. a child requests data from a parent, which either doesn't yet have the data or has perhaps not even received the activation yet. @bosilca has concerns about such a scheme putting additional stress on the network and causing processes to buffer too much data for certain algorithms, but it also has potential to unlock additional overlap opportunities for certain latency-sensitive algorithms. I think there are ways to manage the potential drawbacks and that, overall, this would provide the most robust solution, but it's also somewhat more difficult to implement.
Additional context
I believe that this problem is related to #252, but the communication infrastructure has changed significantly since then; the solution mentioned in that issue is similar to the fourth option proposed above. I have discussed this issue with @bosilca extensively.