Closed syamajala closed 8 months ago
This looks like memory corruption to me. This is before we've even started Legion and we're trying to move an STL vector from one place to another. STL failures like this are almost always memory corruption. Usually when they happen so early in the program though they are not hard to find.
FWIW, LayoutConstraintSet::operator=
isn't even code we've written, it is a compiler-generated operator implementation.
I tried running with valgrind and I'm seeing stuff like this:
==1908013== Conditional jump or move depends on uninitialised value(s)
==1908013== at 0xD9DDA5E: std::vector<Legion::OffsetConstraint, std::allocator<Legion::OffsetConstraint> >::operator=(std::vector<Legion::OffsetConstraint, std::allocator<Legion::OffsetConstraint> > const&) (vector.tcc:224)
==1908013== by 0xD9CF8E6: Legion::LayoutConstraintSet::operator=(Legion::LayoutConstraintSet const&) (legion_constraint.h:746)
==1908013== by 0xE2982B4: Legion::LayoutConstraintRegistrar::operator=(Legion::LayoutConstraintRegistrar const&) (legion.h:2553)
==1908013== by 0xE286AD5: Legion::Internal::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (runtime.cc:30112)
==1908013== by 0xDDDFA34: Legion::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (legion.cc:8202)
==1908013== by 0x585B977: S3DRank::get_fortran_soa_layout() (s3d_rank_mpi.cc:672)
==1908013== by 0x585F369: RegisterCPUVariant<InitTemperatureTask, S3DTask<InitTemperatureTask, 3>, true>::register_variant() (s3d_task.h:808)
==1908013== by 0x585E13E: S3DTask<InitTemperatureTask, 3>::register_variants() (s3d_task.h:989)
==1908013== by 0x585984C: S3DRank::start_legion() (s3d_rank_mpi.cc:156)
==1908013== by 0x585887C: initialize_rhsf_legion_ (rhst_fortran.cc:130)
==1908013== by 0x5B3258: solve_driver_ (solve_driver.f90:194)
==1908013== by 0x5B2BB1: MAIN__ (main.f90:131)
==1908013== Uninitialised value was created by a heap allocation
==1908013== at 0x4C9FD8B: operator new(unsigned long) (vg_replace_malloc.c:417)
==1908013== by 0xE37993C: __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::allocate(unsigned long, void const*) (new_allocator.h:114)
==1908013== by 0xE36B191: std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > > >::allocate(std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >&, unsigned long) (alloc_traits.h:444)
==1908013== by 0xE350C34: std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_get_node() (stl_tree.h:580)
==1908013== by 0xE327ED6: std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >* std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_create_node<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) (stl_tree.h:630)
==1908013== by 0xE2E39C1: std::_Rb_tree_iterator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) (stl_tree.h:2455)
==1908013== by 0xE2B9A26: std::map<unsigned long, Legion::LayoutConstraintRegistrar, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::operator[](unsigned long const&) (stl_map.h:499)
==1908013== by 0xE286AC0: Legion::Internal::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (runtime.cc:30112)
==1908013== by 0xDDDFA34: Legion::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (legion.cc:8202)
==1908013== by 0x585B977: S3DRank::get_fortran_soa_layout() (s3d_rank_mpi.cc:672)
==1908013== by 0x585F369: RegisterCPUVariant<InitTemperatureTask, S3DTask<InitTemperatureTask, 3>, true>::register_variant() (s3d_task.h:808)
==1908013== by 0x585E13E: S3DTask<InitTemperatureTask, 3>::register_variants() (s3d_task.h:989)
There is a full log here: http://sapling2.stanford.edu/~seshu/s3d_stencil/valgrind.txt
I've been staring at this, trying to figure out how it's happening, and so far don't have anything.
Valgrind and the segfault are in agreement, so at the moment we have no reason to distrust what it's telling us. According to valgrind, this is the first error we hit, so there is no memory corruption prior to this point. I would also add that this is so early in the program that nothing in Regent has been initialized yet. We're all in C++ code at this point.
The line that valgrind reports agrees with what we saw in the crash:
Looks about as straightforward as it gets. operator[]
allocates the map entry (running a default constructor) and operator=
assigns to it.
The pending_constraint_table
comes from a local static
variable in this method:
Again, hard to see anything going wrong there.
So the only thing I can figure is the constructor is somehow bad... but if it is, I don't see it. The LayoutConstraintRegistrar
here should implicitly default construct the layout_constraint
field:
And similarly LayoutConstraintSet
has a constructor here that should implicitly initialize all fields:
Just to be sure I wasn't getting the C++ semantics confused, I went and looked them up:
Default-initialization is performed in three situations: ... 3) when a base class or a non-static data member is not mentioned in a constructor initializer list and that constructor is called.
https://en.cppreference.com/w/cpp/language/default_initialization
So yeah, that should be fine.
The application code in question is also so simple that I can't see any way for it to be wrong, but I'll post it here in case @lightsighter sees something:
/*static*/ LayoutConstraintID S3DRank::get_fortran_soa_layout(void)
{
static LayoutConstraintID layout_id = 0;
if (layout_id > 0)
return layout_id;
// We haven't made the constraint set before so do it now
LayoutConstraintRegistrar constraints;
// This should be a normal instance
constraints.add_constraint(SpecializedConstraint(NORMAL_SPECIALIZE));
// Want fortran orldering of dimensions
std::vector<DimensionKind> dim_order(4);
dim_order[0] = DIM_X;
dim_order[1] = DIM_Y;
dim_order[2] = DIM_Z;
dim_order[3] = DIM_F; // SOA: fields are least quickly changing
constraints.add_constraint(OrderingConstraint(dim_order, true/*contiguous*/));
layout_id = Runtime::preregister_layout(constraints);
return layout_id;
}
So I'm thoroughly stumped at this point.
I guess MAX_DIM is set in 2 places in S3D and I only updated one of them.
I'm seeing a seg fault registering layout constraints when the runtime is starting up. I have not changed the layout constraints so I'm not sure why its seg faulting. Here is a stack trace: