jamesETsmith commented 1 year ago

General

This is a followup to #4. This tests fails without crashing, here's a snippet of the error (there may be other errors earlier in the test):

15: Test MMWrite...
15: setup: /net/hyper120h-d/data/jsmith/apps/LAGraph/src/test/test_MMRead.c
15: data is in [/net/hyper120h-d/data/jsmith/apps/LAGraph/data/]
15: Selected mode is not availble on this architecture. Setting mode to GrB_BLOCKING.
15: =============  0: west0067.mtx
15: [ FAILED ]
15:   Case west0067.mtx:
15:     test_MMRead.c:497: Check LAGraph_MMRead (&B, f, msg) == 0... failed
15:     test_MMRead.c:499: Check LAGraph_Matrix_TypeName (btype_name, B, msg) == 0... failed
15:     test_MMRead.c:510: Check ok... failed
15:       Test for A and B equal failed: /net/hyper120h-d/data/jsmith/apps/LAGraph/data/comments_west0067.mtx
15:     test_MMRead.c:532: Check LAGraph_MMRead (&A, f, msg) == 0... failed
15:     test_MMRead.c:535: Check _Generic( (&a), _Bool * : GrB_Matrix_extractElement_BOOL, int8_t * : GrB_Matrix_extractElement_INT8, uint8_t * : GrB_Matrix_extractElement_UINT8, int16_t * : GrB_Matrix_extractElement_INT16, uint16_t * : GrB_Matrix_extractElement_UINT16, int32_t * : GrB_Matrix_extractElement_INT32, uint32_t * : GrB_Matrix_extractElement_UINT32, int64_t * : GrB_Matrix_extractElement_INT64, uint64_t * : GrB_Matrix_extractElement_UINT64, float * : GrB_Matrix_extractElement_FP32, double * : GrB_Matrix_extractElement_FP64, void * : GrB_Matrix_extractElement_UDT)(&a, A, 0, 0) == 0... failed
15:     test_MMRead.c:536: Check isnan (a)... failed
15:

moonwatcher commented 1 year ago

The issues with MMRead all seem to trace to the reducers in matrix_extractTuples. It's easiest to see with the sources_7.mtx, which is a dense, 64x1 column vector, and so is treated like a matrix by the method.

emu::striped_array<GrB_Index> tstart(nrows), sizes(nrows);

emu::parallel::for_each(emu::static_policy<1, 256>(), M.begin(), M.end(), [&](GrB_Index i) {
    emu::repl<GrB_Index> ts_ = 0;
    {
        emu::reducer_opadd<GrB_Index> nsum(&ts_);

        emu::parallel::for_each(emu::static_policy<1, 1>(), M.begin(), M.begin() + i, [&](GrB_Index j) {
            nsum += sizes[j];
        });
    } // end of scope
    tstart[i] = emu::repl_reduce(ts_, std::plus<>());
});

printf("matrix_extractTuples: Row ,tstart, sizes\n");
for (size_t k=0; k<nrows; ++k) {
    printf("%" PRId64 " %" PRId64 " %" PRId64 "\n", k, sizes[k], tstart[k]);
}

This is suppose to compute the number of elements from each row and the offset in the IJX arrays. So for a dense column vector sizes should be all 1s and tstart should be 1,2,3,4,5. but for some reason its not...

matrix_extractTuples: Row ,tstart, sizes
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 1 13
14 1 14
15 1 15
16 1 14
17 1 14
18 1 14
19 1 15
20 1 14
21 1 14
22 1 13
23 1 15
24 1 15
25 1 15
26 1 15
27 1 15
28 1 17
29 1 15
30 1 22
31 1 16
32 1 17
33 1 16
34 1 16
35 1 29
36 1 13
37 1 18
38 1 17
39 1 13
40 1 18
41 1 19
42 1 19
43 1 18
44 1 19
45 1 20
46 1 19
47 1 18
48 1 21
49 1 20
50 1 19
51 1 20
52 1 19
53 1 19
54 1 18
55 1 19
56 1 18
57 1 19
58 1 32
59 1 18
60 1 21
61 1 19
62 1 21
63 1 21

replacing the inner loop with a serial execution:

emu::parallel::for_each(emu::static_policy<1, 256>(), M.begin(), M.end(), [&](GrB_Index i) {
    tstart[i] = 0;
    std::for_each(M.begin(), M.begin() + i, [&](GrB_Index j) {
        tstart[i] += sizes[j];
    });
});

Or even just setting the policy on the inner loop to spawn just one thread (which I assume is the same thing):

emu::parallel::for_each(emu::static_policy<1, 256>(), M.begin(), M.end(), [&](GrB_Index i) {
    emu::repl<GrB_Index> ts_ = 0;
    {
        emu::reducer_opadd<GrB_Index> nsum(&ts_);

        emu::parallel::for_each(emu::static_policy<1, 1>(), M.begin(), M.begin() + i, [&](GrB_Index j) {
            nsum += sizes[j];
        });
    } // end of scope
    tstart[i] = emu::repl_reduce(ts_, std::plus<>());
});

Yields the correct result and the test passes.

Still trying to wrap my head around the syntax for the reducers to understand how they work. Either they are used incorrectly or there is an underlaying bug, but it seems like something isn't parallelizing correctly.

moonwatcher commented 1 year ago

Ok, I was comparing this to other places with the same pattern and its the pointer arithmetic that is the culprit, because this works.

emu::parallel::for_each(emu::static_policy<1, 256>(), M.begin(), M.end(), [&](GrB_Index i) {
    emu::repl<GrB_Index> ts_ = 0;
    {
        emu::reducer_opadd<GrB_Index> nsum(&ts_);

        emu::parallel::for_each(emu::static_policy<1, 4>(), M.begin(), M.end(), [&](GrB_Index j) {
            if(j < i) {
                nsum += sizes[j];
            }
        });
    } // end of scope
    tstart[i] = emu::repl_reduce(ts_, std::plus<>());
});

The M.begin() + i is somehow not working. I suspect its because the M.begin() and M.end() iterate over a emu::striped_array<GrB_Index> so just adding an offset to the address is not returning the expected value...

moonwatcher commented 1 year ago

Ok, this is now verified to happen in an example without LucataGraphBLAS andLAGraph. This is an emu_cxx_utils issue. See https://github.com/moonwatcher/striped_array_iterator_bug

skuntz commented 1 year ago

@tnisley2

tnisley2 commented 1 year ago

I'll look into this. I'm putting my money on emu::repl<GrB_Index> ts_ = 0; being the problem. That's where Matthew and I had problems before.

Here's the documentation for it:

Replicated object wrapper templates:

emu::repl<T>: Replicated primitive (int, long, MyClass*, etc.)

    - Reads will return the value of the local copy.
    - Writes will assign the value to all copies.
    - Must not be created on the stack. Use new, emu::make_repl<T>(), or compose within another replicated object.
    - If you want a replicated primitive that does not broadcast writes to all nodelets, just use a regular primitive in replicated storage (i.e. long instead of emu::repl<long>.
    - Replicated references (i.e. emu::repl<MyClass&>) are not supported yet. Use a pointer instead.

mcordery commented 1 year ago

Getting rid of them in eMultM appears to resolve my translation error and alignment error problems. So it seems like I'm just going to go back to putting in my hack to do this sort of thing. @skuntz

tnisley2 commented 1 year ago

The issue is that the addition operation to the local variable in the reducer is not threadsafe. Changing the reducer.h file to use a remote operation for += solved the issue. I will have to update cxx_utils.

    void operator+=(T &rhs) { this->local_sum_ += rhs; }  // this is not threadsafe

    void operator+=(T &rhs) { remote_add( &(this->local_sum_), rhs); } //this is

I'll change the OP_ADD reducer, but eventually I will need to change the operations on all reducer types to be threadsafe.

tnisley2 commented 1 year ago

I talked with @skuntz and looked at this issue a bit more. The issue isn't that the reducer operations need to be remote. Each thread should have it's own local reducer, so this is not necessary.

From reducer.h

* this reducer is meant to be copied by each thread and carried around in
 * registers. Upon copy, the new reducer holds a reference to the reducer it
 * was copied from. When the copy goes out of scope, it reduces into the
 * original reduction variable.

The problem is we're not using the reducer correctly.

Inside the outer loop, we create the reducer variable. When we get to the inner loop, the lambda is using a REFERENCE to the reducer variable. That means every thread spawned by the inner loop is trying to add to the same memory address.

Instead, we should create the variable ts within the outer loop. Then we create the reducer variable within the inner loop, using the reference to ts. This way each thread gets it's own local copy. When the inner loop exits, this local variable goes out of scope and is reduced into ts. This is how the reducers should be used.

moonwatcher commented 1 year ago

@tnisley2 could you propose a pattern and update the code in https://github.com/moonwatcher/striped_array_iterator_bug ?

mcordery commented 1 year ago

If we want to move to an OpenMP model we need to figure out how to get this to work so that the transition to OpenMP is easy. In that case you just have, say

int sum;
#pragma omp parallel for reduce(+:sum)
for( int j=0; j<n; ++j)
{
..work
sum++;
..work
}

When it comes to for_each we'd just have to provide the emu_cxx_utils mechanism for doing it, a la RAJA


  RAJA::ReduceSum<RAJA::omp_reduce, double> ompdot(0.0);

  RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0, N), [=] (int i) { 
    ompdot += a[i] * b[i]; 
  });    

  dot = ompdot.get();

tnisley2 commented 1 year ago

Moving the creation of the reducer into the inner loop creates a reducer variable for every iteration of every thread. Doesn't seem like there's an easy way to fix this right now.

We either need to change how the reducers work or come up with a way to pass a reducer to each thread in the for_each.

tnisley2 commented 1 year ago

@mcordery

I see OpenMP has a special function for this. It's possible we need new and different functions to process different things in parallel other than for_each. Right now, for_each is a big hammer we swing for anything that needs processed in parallel.

moonwatcher commented 1 year ago

Sounds to me like we need a hierarchy of for_eachs. Trying to cram every scenario into one construct will end up with (probably already is) unmanageable code.

mogill commented 1 year ago

I think things that are different should look different. Our memory and migration model do not neatly fit into the so-called performance portable programming models like Sycl and Kokkos. Our solution must consider data structure placement as part of the thread model.

mcordery commented 1 year ago

I’m not saying it needs to look ‘exactly’ like it but that ease of use and expected behavior needs to be considered. If I have to deal with ‘Oh, reducers only give you a value when the destructor is called’ that’s not really something I want to have to deal with. We need to be as portable as possible otherwise people won’t buy in when they could just say ‘We’ll just wait for NVIDIA to come up with something’.

From: Jace A Mogill @.> Sent: Wednesday, May 31, 2023 7:58 AM To: emusolutions/LAGraph @.> Cc: Matthew Cordery @.>; Mention @.> Subject: Re: [emusolutions/LAGraph] ctest_MMRead LC Failure (Issue #12)

I think things that are different should look different. Our memory and migration model do not neatly fit into the so-called performance portable programming models like Sycl and Kokkos. Our solution must consider data structure placement as part of the thread model.

— Reply to this email directly, view it on GitHubhttps://github.com/emusolutions/LAGraph/issues/12#issuecomment-1570291219, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEJ5XWOI4VYUAEPD5LWOVKLXI5E6DANCNFSM6AAAAAAYD2KGXI. You are receiving this because you were mentioned.Message ID: @.**@.>>

tnisley2 commented 1 year ago

I think there's some confusion about the reduction happening when a reducer goes out of scope and the destructor is called. The idea is that there is one local reducer copy per thread in a loop. We need to pass the reducer variable into the for_each loop by value. This calls the copy constructor to create the local reducer for that thread. Once that thread finishes, the value is added to the global reducer variable. So once you exit the loop, all the values should have reduced.

If that reducer is a replicated variable, then we also need to call repl_reduce to combine the reduced value on all nodes.

tnisley2 commented 1 year ago

Here's code for a solution that worked for me. This is how reducers are used in Beedrill:

emu::parallel::for_each(emu::static_policy<1, 256>(), M.begin(), M.end(), [&](GrB_Index i) {
    //emu::repl<GrB_Index> ts_ = 0;  // don't use repl<T>
    GrB_Index ts = 0; // row is located on a single node, no need to have a replicated variable
    //{  this extra scope is not needed. We reduce when the LOCAL reducer goes out of scope
        emu::reducer_opadd<GrB_Index> nsum(&ts); // declare the global reducer variable
        // Pass in the reducer by value. This calls the copy constructor, thus creating a local reducer for the thread in the loop
        emu::parallel::for_each(emu::static_policy<1, 4>(), M.begin(), M.begin() + i, [&, nsum](GrB_Index j) mutable { 
            nsum += sizes[j]; // add to the local reducer.
        }); // as each thread finishes, it's local reducer is reduced into the global reducer
    //} // end of scope  - not needed
    tstart[i] = ts;        // after the cilk_sync, everything should be reduced into nsum for this row
});

Edit: Lot's of comments here, but I feel there is a lot of misunderstanding about how the reducers and replicateds work.

mogill commented 1 year ago

I agree there is confusion about how things work, but some of this discussion is also about how things should work. There's also the notion that at different scales, different parallel reduction strategies make the most sense.

We're limited to Cilk-style for now, but I think Matt and I share the opinion this stuff is best handled by the compiler, as with OMP, which can implement a bunch of different strategies and select the best one at runtime based on data size, locality, amount of CPU resources available, etc.

The compiler can also implement this kind of scaffolding more consistently than humans, reducing opportunities for human error like the inner reduction of @tnisley2's example: nsum += sizes[j] is in a parallel region but is not an atomic update.

tnisley2 commented 1 year ago

@mogill The nsum += sizes[j] ultimately does not need to be atomic if we understand how it operates. There should be one local reducer per thread; as a result there is no data race that occurs.

mogill commented 1 year ago

Coming from the perspective of how things might be done instead of how Cilk does things, this seems brittle to me: Users need to know how reducer_opadd is implemented to use it correctly, and the implementation can't be changed to one counter per node or core (which would almost certainly perform better) without breaking applications.

A user's expectations are generally architectural (i.e.: a sum reduction occurs), but if we require them to know the implementation details (i.e. cyclic reduction with per thread temp storage occurs), we will violate user expectations and they'll end up building their own ad-hoc solutions.

tnisley2 commented 1 year ago

From what I understand, these reducers are just an extension of cilk_reducers, written because they worked at the time. Most of the confusion comes from the fact that documentation is limited and we are trying to use the source code to understand them.

moonwatcher commented 1 year ago

@mogill there will probably be additional programming models implemented in the near, or far, future and these conversations indeed inform the rest of the team on how things work now, why this is broken and what we expect from future implementations. But I think the point @tnisley2 is trying to make is that, as opaque and weird and difficult to understand as it is, this implementation is working. And we ARE at the phase of trying to optimize LucataGraphBLAS and we need these constructs. If we wait until such future programing models manifest we will not have a worthwhile implementation of GraphBLAS any time soon. So we need this conversation about how a future model will look to branch off into a different topic.

https://github.com/moonwatcher/striped_array_iterator_bug/blob/d53e2cf02bf1a25fe479bf0d365d978fcc88977a/CTest/test_striped_array_iterator.cc#L90-L110

Here is what I have so far, in terms of a pattern. It works but exhibits several copy constructor calls that could probably be eliminated. those however are probably coming from the emu::for_each implementation and getting rid of them should be considered an emu_cxx_utils optimization. I am going to extend this test to include more variations in the policies but as far as @mcordery is concerned they will be "tested" (@jamesETsmith has already set us up a test framework on emu_cxx_utils and this test will migrate there.

@mcordery can you point me to additional similar parallelized emu::for_each loops that you suspect might be misbehaving so we can add test cases?

jamesETsmith commented 1 year ago

~~@moonwatcher, what's the status on this? Can it be merged?~~. Got mixed up and thought this was a PR. I think we can close this now right @moonwatcher?

moonwatcher commented 1 year ago

yeah, I think this was resolved.

emusolutions / LAGraph

`ctest_MMRead` LC Failure #12

General