Broadcast incorrect under certain circumstances, causing segfaults

Describe the bug

PaRSEC reduces the overhead of task activations by only sending a single activation message per destination process; if a process needs multiple output data from a task, it is included in the activation tree of the first output processed. Because processes request only the data that they themselves need and that data is fetched from their parents in the activation tree, if that parent does not need a superset of the data required by the child, the broadcast will go awry and the parent process will segfault on accessing the non-existent data.

The current PaRSEC broadcast looks like this:

At Parent in Broadcast Tree

Build destination list, for each output
For each output, for each destination
1. If destination not yet in forwarded list
  1. Add destination to tree
  2. If destination is child of current node, send activation to destination, including pointer to deps structure
  3. Mark destination as forwarded

At Child in Broadcast Tree

Build list of outputs needed locally
Release all needed outputs with no data
For each remaining needed output
1. Allocate memory for data
2. Send request for data to parent, including pointer to local data and pointer to parent’s deps structure

At Parent in Broadcast Tree

Retrieve deps and memory for data
Start put to child

At Child in Broadcast Tree

Release output
If all outputs retrieved
1. Propagate activation

What this means

Suppose a task executed on process x has two outputs, A and B. A is needed on both processes y and z, but B is needed only on process z. If both 1) output A is ordered before output B and 2) process y is ordered before process z in the computed topology, then:

x will send an activation to y
y will request A from x
Once y has A, it will propagate activation to z
z will request both A and B from y
y does not have B and will segfault

To Reproduce

Compile and run the following with three processes and using the chain broadcast algorithm.

extern "C" %{
#include <parsec.h>
#include <parsec/data_dist/matrix/two_dim_rectangle_cyclic.h>
#include <parsec/data_dist/matrix/matrix.h>
#include <parsec/vpmap.h>

#include <stdlib.h>
#include <stdio.h>

#if defined(PARSEC_HAVE_MPI)
#include <mpi.h>
#endif

%}

descA       [ type = "parsec_tiled_matrix_dc_t*" ]

start(i)
i = 0 .. 0
: descA(0, i)
RW A <- descA(0, i)
     -> A task_x(1)
     -> A task_y(2)
RW B <- descA(0, i)
     -> B task_y(2)
; 2
BODY
    printf("%d: have A and B\n", es->virtual_process->parsec_context->my_rank);
    fflush(stdout);
END

task_x(i)
i = 1 .. 1
: descA(0, i)
READ A <- A start(0)
; 1
BODY
    printf("%d: have A\n", es->virtual_process->parsec_context->my_rank);
    fflush(stdout);
END

task_y(i)
i = 2 .. 2
: descA(0, i)
READ A <- A start(0)
READ B <- B start(0)
; 0
BODY
    printf("%d: have A and B\n", es->virtual_process->parsec_context->my_rank);
    fflush(stdout);
END

extern "C" %{
void parsec_bcast_bug_Destruct(parsec_taskpool_t *taskpool)
{
  parsec_bcast_bug_taskpool_t *bcast_bug_taskpool = (parsec_bcast_bug_taskpool_t *)taskpool;
  parsec_matrix_del2arena(&bcast_bug_taskpool->arenas_datatypes[PARSEC_bcast_bug_DEFAULT_ARENA]);
  parsec_taskpool_free(taskpool);
}

int main(int argc, char *argv[])
{
    parsec_context_t* parsec;
    parsec_taskpool_t* bcast_bug_taskpool;
    parsec_bcast_bug_taskpool_t* taskpool = NULL;
    int rank, nodes;
    int size = 128;

    /* Default */
    int cores = 1;

#if defined(PARSEC_HAVE_MPI)
    {
        int provided;
        MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    }
    MPI_Comm_size(MPI_COMM_WORLD, &nodes);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#else
    nodes = 1;
    rank = 0;
#endif

    if( nodes != 3 ) {
        /* need exactly 3 nodes */
        printf("Need 3 nodes, have %d\n", nodes);
        exit(-1);
    }

    /* Initialize PaRSEC */
    parsec = parsec_init(cores, &argc, &argv);

    if( NULL == parsec ) {
        /* Failed to correctly initialize. In a correct scenario report
         * upstream, but in this particular case bail out.
         */
        exit(-1);
    }

    printf("%d: starting\n", rank);

    /* initializing matrix structure */
    two_dim_block_cyclic_t dcA;
    two_dim_block_cyclic_init(&dcA, matrix_RealDouble, matrix_Tile,
                              nodes, rank, 1, size, 1, size*nodes, 0, 0,
                              1, size*nodes, 1, 1, 1);
    dcA.mat = parsec_data_allocate((size_t)dcA.super.nb_local_tiles *
                                   (size_t)dcA.super.bsiz *
                                   (size_t)parsec_datadist_getsizeoftype(dcA.super.mtype));
    parsec_data_collection_set_key((parsec_data_collection_t*)&dcA, "dcA");

    taskpool = parsec_bcast_bug_new((parsec_tiled_matrix_dc_t *)&dcA);
    bcast_bug_taskpool = (parsec_taskpool_t*)taskpool;

    parsec_matrix_add2arena( &taskpool->arenas_datatypes[PARSEC_bcast_bug_DEFAULT_ARENA],
                             parsec_datatype_double_t, matrix_UpperLower,
                             1, 1, size, 1,
                             PARSEC_ARENA_ALIGNMENT_SSE, -1 );

    parsec_context_add_taskpool(parsec, bcast_bug_taskpool);
    parsec_context_start(parsec);
    parsec_context_wait(parsec);

    parsec_bcast_bug_Destruct(bcast_bug_taskpool);
    parsec_tiled_matrix_dc_destroy((parsec_tiled_matrix_dc_t*)&dcA);

    /* Clean up parsec*/
    parsec_fini(&parsec);
#if defined(PARSEC_HAVE_MPI)
    MPI_Finalize();
#endif
    return 0;
}
%}

Expected behavior

PaRSEC should not crash for a valid task graph.

Specifically, the broadcast topology should not be constructed in such an invalid manner. There are several solutions to this, with varying tradeoffs.

Disable all activation merging/deduplication; send a different activation to each destination for each output. This requires a bit of work to allow a process to receive multiple activations for the same task, as currently this will cause problems. This also will increase overheads, perhaps significantly for task graphs with several outputs that always travel together; I believe this is a common pattern for many PaRSEC applications.
Allow processes to fetch data needed by descendants in the activation tree, not just themselves. This may significantly increase bandwidth utilization and required temporary storage for task graphs that currently exhibit the crashing behavior documented above.
Partition the set of destination processes into subsets that require the exact same set of data, and use these partitions to build independent activation trees. This solution doesn't increase the number of messages or the amount of data sent over the network, but has overhead in constructing these partitions: naïvely, O(2^n) for computing all possible intersections for n outputs; or O(k), where k is the sum of destinations needed by each output, by using a partition refinement algorithm.

Another option is to build a global tree for all activations and send each output data along separate per-output tree. This requires some additional support for handling out-of-order requests for data where e.g. a child requests data from a parent, which either doesn't yet have the data or has perhaps not even received the activation yet. @bosilca has concerns about such a scheme putting additional stress on the network and causing processes to buffer too much data for certain algorithms, but it also has potential to unlock additional overlap opportunities for certain latency-sensitive algorithms. I think there are ways to manage the potential drawbacks and that, overall, this would provide the most robust solution, but it's also somewhat more difficult to implement.

Additional context

I believe that this problem is related to #252, but the communication infrastructure has changed significantly since then; the solution mentioned in that issue is similar to the fourth option proposed above. I have discussed this issue with @bosilca extensively.

ICLDisco / parsec