Closed anyzelman closed 3 months ago
Crossref #303
Concept release notes while waiting for tests and reviews:
This MR fixes a bug where lpf_coll_t
were not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives presently ignores the erroneous argument that ALP was passing in.
While fixing the bug, this MR furthermore detaches the implementation of the public grb::collectives<>
API for ALP's distributed-memory backends from the implementation of the internal collectives. It furthermore introduces a global collectives instance that is directly managed by the distributed-memory ALP backend, and in so doing, prevents having to create an lpf_coll_t
instance every time a collective is called.
As always, the MR includes both related and unrelated code style fixes.
Some issues with smoke tests show that the previous pre/post amble programming pattern for collective communication is wholly incompatible with driving the LPF collectives as its specs require-- and sadly to the point #303 should be resolved now. Getting on with that now...
(At this point though: all unit tests with LPF are OK)
Handled #303 as well with the latest commits. Running the full test suite with LPF (including perftests) to make sure there are no snafus that snuck in for the BSP1D and hybrid backends.
There are snafus:
process A:
(gdb) bt
#0 0x00007f52b45a72c0 in MPID_nem_tcp_connpoll () from /usr/lib64/mpich/lib/libmpi.so.12
#1 0x00007f52b4596212 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#2 0x00007f52b44859ef in PMPI_Sendrecv () from /usr/lib64/mpich/lib/libmpi.so.12
#3 0x00007f52b47783c6 in sparse_all_to_all () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#4 0x00007f52b475de74 in lpf::mpi::SparseAllToAll::exchange(lpf::mpi::Comm const&, bool, int*, int) ()
from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#5 0x00007f52b475915b in lpf::MessageQueue::sync(bool) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#6 0x00007f52b4769996 in lpf::Interface::abort() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#7 0x00007f52b474ee4e in lpf::Process::hook(lpf::mpi::Comm const&, lpf::Process&, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) [clone .cold] () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#8 0x00007f52b4767377 in lpf::Process::exec(unsigned int, void (*)(void*, unsigned int, unsigned int, lpf_args), lpf_args) ()
from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#9 0x000000000040cd61 in grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::run_lpf<unsigned long, grb::RC, grb::internal::ExecDispatcher<unsigned long, grb::RC, (grb::EXEC_MODE)0, true, false> > (data_out=0x7ffe85d86c44, in_size=8, data_in=0x7ffe85d86c48, alp_program=<optimized out>,
this=<synthetic pointer>) at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:620
#10 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::pack_data_and_run<unsigned long, grb::RC, false> (broadcast=true, data_out=0x7ffe85d86c44,
in_size=8, data_in=0x7ffe85d86c48, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:702
#11 grb::internal::BaseLpfLauncher<(grb::EXEC_MODE)0>::exec<unsigned long, grb::RC> (broadcast=true, data_out=@0x7ffe85d86c44: grb::SUCCESS,
data_in=@0x7ffe85d86c48: 100, alp_program=0x40ecc0 <grb_program(unsigned long const&, grb::RC&)>, this=<synthetic pointer>)
at /home/yzelman/Documents/graphblas/include/graphblas/bsp1d/exec.hpp:762
#12 main (argc=<optimized out>, argv=0x7ffe85d86ee8) at /home/yzelman/Documents/graphblas/tests/unit/eWiseMatrix.cpp:210
Process B:
#0 0x00007fe2b5662219 in MPIDI_CH3I_Progress () from /usr/lib64/mpich/lib/libmpi.so.12
#1 0x00007fe2b555a133 in MPIR_Wait_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#2 0x00007fe2b55d18da in MPIC_Wait () from /usr/lib64/mpich/lib/libmpi.so.12
#3 0x00007fe2b55d1e43 in MPIC_Recv () from /usr/lib64/mpich/lib/libmpi.so.12
#4 0x00007fe2b558f1e3 in MPIR_Bcast_intra_binomial () from /usr/lib64/mpich/lib/libmpi.so.12
#5 0x00007fe2b54ea4df in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#6 0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#7 0x00007fe2b559079b in MPIR_Bcast_intra_smp () from /usr/lib64/mpich/lib/libmpi.so.12
#8 0x00007fe2b54ea4ff in MPIR_Bcast_intra_auto () from /usr/lib64/mpich/lib/libmpi.so.12
#9 0x00007fe2b54ea685 in MPIR_Bcast_impl () from /usr/lib64/mpich/lib/libmpi.so.12
#10 0x00007fe2b54eaf00 in PMPI_Bcast () from /usr/lib64/mpich/lib/libmpi.so.12
#11 0x00007fe2b582d672 in lpf::mpi::Comm::broadcast(void*, unsigned long, int) const ()
from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#12 0x00007fe2b5833a29 in lpf::Process::slave() () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#13 0x00007fe2b5833aa8 in lpf::Process::Process(lpf::mpi::Comm const&) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#14 0x00007fe2b58364f0 in lpf::Interface::initRoot(int*, char***) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#15 0x00007fe2b582205e in lpf::mpi_initializer(int, char**) () from /scratch/lpf-mpich-install/lib/liblpf_core_univ_mpimsg_Release.so
#16 0x00007fe2b586ef2a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fe2b586f031 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#18 0x00007fe2b586014a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#19 0x0000000000000001 in ?? ()
#20 0x00007ffff52d37ae in ?? ()
#21 0x0000000000000000 in ?? ()
Process B is still in daemon mode that process A is supposed to resolve via its call to lpf_exec
. The last traceable code block (ALP side) is bsp1d/exec.hpp, around line 620:
RC run_lpf(
const lpf_func_t alp_program,
const void * const data_in,
const size_t in_size,
U * const data_out
) const {
// construct LPF I/O args
lpf_args_t args = {
data_in, in_size,
data_out, sizeof( U ),
&alp_program, 1
};
// get LPF function pointer
lpf_spmd_t fun = reinterpret_cast< lpf_spmd_t >(
internal::alp_exec_dispatch< T, U, DispatcherType > );
// execute
const lpf_err_t spmdrc = init == LPF_INIT_NONE
? lpf_exec( LPF_ROOT, LPF_MAX_P, fun, args )
: lpf_hook( init, fun, args );
while the trace confirms lpf_exec
is indeed called.
Fixes for the first snafu reported above are pushed, with big thanks to the LPF debug layer. Since the fixed issue included memory corruption (put/get into unregistered memory), this may also fix the second snafu-- re-running the full test suite.
Also added @KADichev as reviewer from an LPF perspective, which this MR is mostly about. Smoke tests find two more snafus that I'm gonna look at now:
@KADichev -- please hold off reviewing until the MR is flagged ready, I'll need to 1) fix the above snafu and 2) clean up the code.
Bugs fixed and re-running tests. I will simultaneously start cleaning up the code.
@alberto-scolari @KADichev this one is ready for review. Smoke tests with LPF pass, and all tests compile. I'm running the full test suite with LPF and with perftests on both x86 and ARM for extra checks. The internal CI is also testing this with LPF enabled. Expected is that tomorrow we have all this in-depth checking done.
Updating test progress: so far all ok.
For x86, perftests are a-OK up until (and including) Freescale1.mtx, which is already a pretty big one-- so it's relatively certain this MR is stable. This is for the hybrid engine, P=2, meaning it also passed bsp1d at P=1. Internal CI with LPF also passed.
I'll let the ARM one run ahead a bit more, while killing the x86 perftests in favour of resolving/testing other MRs.
In-depth testing all OK. @KADichev @alberto-scolari please review and approve
I will try to review carefully next week. I am far behind on my day-to-day HiCR stuff.
as I am only a reviewer (and not a decision-maker) and my reviews are posted, I am approving this PR to let it advance
as I am only a reviewer (and not a decision-maker) and my reviews are posted, I am approving this PR to let it advance
For next time totally ok to do make it a RFC rather than approve, there are good comments there and I will take several.
It's also helpful / freely ok if you have remarks that kind of escape the context of the particular MR / issue, to make other issues to address them-- or if in doubt, to still post them as an RFC. For example, I did already make such an issue for LPF (and one for reducing code dup already existed there), while for some others (like BSP1D_data) I'll indeed address them here since they should;) be straightforward enough.
Anyway, also please @KADichev have a check which of the comments you want to take note of-- I do recommend to put the useful comments in separate issues, trackers are super useful for not forgetting good ideas. This issue I hope may merge in the following days.
After merging in latest develop, found not all LPF tests complete successfully. This was to a bug regarding casting behaviour in the new allreduce implementation. This has been fixed by creating a generic allreduce/reduce implementation with corrected casting. However, fixing it found that sometimes a monoid identity is required for proper behaviour, which leads to a specification change. Before making that final in the base spec, now first making sure the tests are OK again. Another TODO remains to apply Alberto's suggested fixes.
Quickly before flying out: the last commit is bugged again-- pinnedVector for BSP1D P=2 deadlocks while several other unit tests fail. The easiest to debug are likely set and capacity.
Updated concept release notes:
This MR fixes a bug where lpf_coll_t
was not initialised properly. This bug never resulted in a noticeable issue, since the present implementation of the LPF collectives ignores the erroneous argument that ALP was passing in. The nevertheless not-up-to-spec initialisation of LPF collectives happened repeatedly throughout the code base. This MR totally overhauls the implementation of the collectives to address the issue.
Within LPF, there are two classes of collectives: the one exposed via the public API grb::collectives<>
, and raw internal collectives that are used in the implementation of e.g. the BSP1D
and hybrid
backends. Prior to this MR, the implementations of these collectives were disjoint, while with this MR, both types of collectives are managed by a shared context (internally encapsulated by the BSP1D_Data
object). This central management means every process maintains but a single lpf_coll_t
instance that is reused every time a public or internal collective is called.
The revision of the collectives also now includes a systematic choice where collectives on scalars are orchestrated through the global buffer so as to prevent having to register the scalar memory addresses every time a collective is called. This is likely to cut latency costs significantly compared to the status prior to this MR. For collectives on arrays, the source and/or destination arrays will be registered which incurs a latency costs-- but a cost that, assuming large enough array sizes, is much lower than that of having to copy the payload to and from the global buffer.
Features this MR introduces in realising the refactored collectives:
Other issues this MR addresses:
grb::setElement
correctly -- these would trigger only in out-of-memory conditions, or in erroneous use of the primitive, so impact of the bug is expected to be minor;grb::setElement
and grb::select
base specifications did not list all error conditions or did not do so correctly, herewith fixed;grb::select
had in _DEBUG
mode an attempt to print an iterator to stdout
, instead of first dereferencing it-- herewith fixed.As always, the MR includes both related and unrelated code style fixes.
Hi @alberto-scolari @KADichev this one is ready, pending CIs and a manual test run.
@alberto-scolari your comments have been handled in line with the (resolved) discussions above.
If one of you could flag it as approved by tomorrow morning we can merge it (if tests indeed all OK:)
All tests and CIs OK at this point
Hi @KADichev -- this MR has been merged and is closed, but I'll retain the branch so you can still refer to the above comments in case useful for the LPF collectives
Closes #249 .
Closes #303 .