Algebraic-Programming / ALP

Home of ALP/GraphBLAS and ALP/Pregel, featuring shared- and distributed-memory auto-parallelisation of linear algebraic and vertex-centric programs. Soon with more to come!
Apache License 2.0
24 stars 4 forks source link

249 alp passes 0 as max calls to lpf collectives initialisation #302

Closed anyzelman closed 3 months ago

anyzelman commented 5 months ago

Closes #249 .

Closes #303 .

anyzelman commented 5 months ago

Crossref #303

anyzelman commented 5 months ago

Concept release notes while waiting for tests and reviews:

This MR fixes a bug where lpf_coll_t were not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives presently ignores the erroneous argument that ALP was passing in.

While fixing the bug, this MR furthermore detaches the implementation of the public grb::collectives<> API for ALP's distributed-memory backends from the implementation of the internal collectives. It furthermore introduces a global collectives instance that is directly managed by the distributed-memory ALP backend, and in so doing, prevents having to create an lpf_coll_t instance every time a collective is called.

As always, the MR includes both related and unrelated code style fixes.

anyzelman commented 5 months ago

Some issues with smoke tests show that the previous pre/post amble programming pattern for collective communication is wholly incompatible with driving the LPF collectives as its specs require-- and sadly to the point #303 should be resolved now. Getting on with that now...

(At this point though: all unit tests with LPF are OK)

anyzelman commented 5 months ago

Handled #303 as well with the latest commits. Running the full test suite with LPF (including perftests) to make sure there are no snafus that snuck in for the BSP1D and hybrid backends.

anyzelman commented 5 months ago

There are snafus:

process A:

(gdb) bt
#0  0x00007f52b45a72c0 in MPID_nem_tcp_connpoll () from /usr/lib64/mpich/lib/libmpi.so.12
#1  0x00007f52b4596212 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#2  0x00007f52b44859ef in PMPI_Sendrecv () from /usr/lib64/mpich/lib/libmpi.so.12
#3  0x00007f52b47783c6 in sparse_all_to_all () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#4  0x00007f52b475de74 in lpf::mpi::SparseAllToAll::exchange(lpf::mpi::Comm const&, bool, int*, int) ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#5  0x00007f52b475915b in lpf::MessageQueue::sync(bool) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#6  0x00007f52b4769996 in lpf::Interface::abort() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#7  0x00007f52b474ee4e in lpf::Process::hook(lpf::mpi::Comm const&, lpf::Process&, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) [clone .cold] () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#8  0x00007f52b4767377 in lpf::Process::exec(unsigned int, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#9  0x000000000040cd61 in grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::run_lpf<unsigned long, grb::RC, grb::internal::ExecDispatcher<unsigned long, grb::RC, (grb::EXEC_MODE)0, true, false> > (data_out=0x7ffe85d86c44, in_size=8, data_in=0x7ffe85d86c48, alp_program=<optimized out>, 
    this=<synthetic pointer>) at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:620
#10 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::pack_data_and_run<unsigned long, grb::RC, false> (broadcast=true, data_out=0x7ffe85d86c44, 
    in_size=8, data_in=0x7ffe85d86c48, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
    at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:702
#11 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::exec<unsigned long, grb::RC> (broadcast=true, data_out=@0x7ffe85d86c44: grb::SUCCESS, 
    data_in=@0x7ffe85d86c48: 100, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
    at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:762
#12 main (argc=<optimized out>, argv=0x7ffe85d86ee8) at /home/yzelman/Documents/graphblas/tests/unit/eWiseMatrix.cpp:210

Process B:

#0  0x00007fe2b5662219 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#1  0x00007fe2b555a133 in MPIR_Wait_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#2  0x00007fe2b55d18da in MPIC_Wait () from /usr/lib64/mpich/lib/libmpi.so.12
#3  0x00007fe2b55d1e43 in MPIC_Recv () from /usr/lib64/mpich/lib/libmpi.so.12
#4  0x00007fe2b558f1e3 in MPIR_Bcast_intra_binomial () from /usr/lib64/mpich/lib/libmpi.so.12
#5  0x00007fe2b54ea4df in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#6  0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#7  0x00007fe2b559079b in MPIR_Bcast_intra_smp () from /usr/lib64/mpich/lib/libmpi.so.12
#8  0x00007fe2b54ea4ff in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#9  0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#10 0x00007fe2b54eaf00 in PMPI_Bcast () from /usr/lib64/mpich/lib/libmpi.so.12
#11 0x00007fe2b582d672 in lpf::mpi::Comm::broadcast(void*, unsigned long, int) const ()
   from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#12 0x00007fe2b5833a29 in lpf::Process::slave() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#13 0x00007fe2b5833aa8 in lpf::Process::Process(lpf::mpi::Comm const&) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#14 0x00007fe2b58364f0 in lpf::Interface::initRoot(int*, char***) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#15 0x00007fe2b582205e in lpf::mpi_initializer(int, char**) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#16 0x00007fe2b586ef2a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fe2b586f031 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#18 0x00007fe2b586014a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#19 0x0000000000000001 in ?? ()
#20 0x00007ffff52d37ae in ?? ()
#21 0x0000000000000000 in ?? ()

Process B is still in daemon mode that process A is supposed to resolve via its call to lpf_exec. The last traceable code block (ALP side) is bsp1d/exec.hpp, around line 620:

                                RC run_lpf(
                                        const lpf_func_t alp_program,
                                        const void * const data_in,
                                        const size_t in_size,
                                        U * const data_out
                                ) const {
                                        // construct LPF I/O args
                                        lpf_args_t args = {
                                                data_in, in_size,
                                                data_out, sizeof( U ),
                                                &alp_program, 1
                                        };

                                        // get LPF function pointer
                                        lpf_spmd_t fun = reinterpret_cast< lpf_spmd_t >(
                                                internal::alp_exec_dispatch< T, U, DispatcherType > );

                                        // execute
                                        const lpf_err_t spmdrc = init == LPF_INIT_NONE
                                                ? lpf_exec( LPF_ROOT, LPF_MAX_P, fun, args )
                                                : lpf_hook( init, fun, args );

while the trace confirms lpf_exec is indeed called.

anyzelman commented 5 months ago

Fixes for the first snafu reported above are pushed, with big thanks to the LPF debug layer. Since the fixed issue included memory corruption (put/get into unregistered memory), this may also fix the second snafu-- re-running the full test suite.

anyzelman commented 5 months ago

Also added @KADichev as reviewer from an LPF perspective, which this MR is mostly about. Smoke tests find two more snafus that I'm gonna look at now:

@KADichev -- please hold off reviewing until the MR is flagged ready, I'll need to 1) fix the above snafu and 2) clean up the code.

anyzelman commented 5 months ago

Bugs fixed and re-running tests. I will simultaneously start cleaning up the code.

anyzelman commented 5 months ago

@alberto-scolari @KADichev this one is ready for review. Smoke tests with LPF pass, and all tests compile. I'm running the full test suite with LPF and with perftests on both x86 and ARM for extra checks. The internal CI is also testing this with LPF enabled. Expected is that tomorrow we have all this in-depth checking done.

anyzelman commented 5 months ago

Updating test progress: so far all ok.

anyzelman commented 5 months ago

For x86, perftests are a-OK up until (and including) Freescale1.mtx, which is already a pretty big one-- so it's relatively certain this MR is stable. This is for the hybrid engine, P=2, meaning it also passed bsp1d at P=1. Internal CI with LPF also passed.

I'll let the ARM one run ahead a bit more, while killing the x86 perftests in favour of resolving/testing other MRs.

anyzelman commented 5 months ago

In-depth testing all OK. @KADichev @alberto-scolari please review and approve

KADichev commented 4 months ago

I will try to review carefully next week. I am far behind on my day-to-day HiCR stuff.

alberto-scolari commented 4 months ago

as I am only a reviewer (and not a decision-maker) and my reviews are posted, I am approving this PR to let it advance

anyzelman commented 4 months ago

as I am only a reviewer (and not a decision-maker) and my reviews are posted, I am approving this PR to let it advance

For next time totally ok to do make it a RFC rather than approve, there are good comments there and I will take several.

It's also helpful / freely ok if you have remarks that kind of escape the context of the particular MR / issue, to make other issues to address them-- or if in doubt, to still post them as an RFC. For example, I did already make such an issue for LPF (and one for reducing code dup already existed there), while for some others (like BSP1D_data) I'll indeed address them here since they should;) be straightforward enough.

Anyway, also please @KADichev have a check which of the comments you want to take note of-- I do recommend to put the useful comments in separate issues, trackers are super useful for not forgetting good ideas. This issue I hope may merge in the following days.

anyzelman commented 4 months ago

After merging in latest develop, found not all LPF tests complete successfully. This was to a bug regarding casting behaviour in the new allreduce implementation. This has been fixed by creating a generic allreduce/reduce implementation with corrected casting. However, fixing it found that sometimes a monoid identity is required for proper behaviour, which leads to a specification change. Before making that final in the base spec, now first making sure the tests are OK again. Another TODO remains to apply Alberto's suggested fixes.

anyzelman commented 4 months ago

Quickly before flying out: the last commit is bugged again-- pinnedVector for BSP1D P=2 deadlocks while several other unit tests fail. The easiest to debug are likely set and capacity.

anyzelman commented 3 months ago

Updated concept release notes:

This MR fixes a bug where lpf_coll_t was not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives ignores the erroneous argument that ALP was passing in. The nevertheless not-up-to-spec initialisation of LPF collectives happened repeatedly throughout the code base. This MR totally overhauls the implementation of the collectives to address the issue.

Within LPF, there are two classes of collectives: the one exposed via the public API grb::collectives<>, and raw internal collectives that are used in the implementation of e.g. the BSP1D and hybrid backends. Prior to this MR, the implementations of these collectives were disjoint, while with this MR, both types of collectives are managed by a shared context (internally encapsulated by the BSP1D_Data object). This central management means every process maintains but a single lpf_coll_t instance that is reused every time a public or internal collective is called.

The revision of the collectives also now includes a systematic choice where collectives on scalars are orchestrated through the global buffer so as to prevent having to register the scalar memory addresses every time a collective is called. This is likely to cut latency costs significantly compared to the status prior to this MR. For collectives on arrays, the source and/or destination arrays will be registered which incurs a latency costs-- but a cost that, assuming large enough array sizes, is much lower than that of having to copy the payload to and from the global buffer.

Features this MR introduces in realising the refactored collectives:

Other issues this MR addresses:

As always, the MR includes both related and unrelated code style fixes.

anyzelman commented 3 months ago

Hi @alberto-scolari @KADichev this one is ready, pending CIs and a manual test run.

@alberto-scolari your comments have been handled in line with the (resolved) discussions above.

If one of you could flag it as approved by tomorrow morning we can merge it (if tests indeed all OK:)

anyzelman commented 3 months ago

All tests and CIs OK at this point

anyzelman commented 3 months ago

Hi @KADichev -- this MR has been merged and is closed, but I'll retain the branch so you can still refer to the above comments in case useful for the LPF collectives