StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
687 stars 144 forks source link

Legion: Seg fault registering layout constraint #1641

Closed syamajala closed 8 months ago

syamajala commented 8 months ago

I'm seeing a seg fault registering layout constraints when the runtime is starting up. I have not changed the layout constraints so I'm not sure why its seg faulting. Here is a stack trace:

#0  0x000015554b6152a6 in Legion::OffsetConstraint::operator= (this=0xbd833529e37c66ed) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion_constraint.h:632
#1  0x000015554b615428 in std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<Legion::OffsetConstraint*, Legion::OffsetConstraint*> (__first=0x1555547add26 <std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_Rb_tree_impl<std::less<unsigned int>, true>::_Rb_tree_impl()+52>, __last=0x34363ff274417fe, __result=0xbd833529e37c66ed) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/stl_algobase.h:342
#2  0x000015554b60e0d3 in std::__copy_move_a<false, Legion::OffsetConstraint*, Legion::OffsetConstraint*> (__first=0x1555547add26 <std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_Rb_tree_impl<std::less<unsigned int>, true>::_Rb_tree_impl()+52>, __last=0x34363ff274417fe, __result=0xbd833529e37c66ed) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/stl_algobase.h:404
#3  0x000015554b603680 in std::__copy_move_a2<false, Legion::OffsetConstraint*, Legion::OffsetConstraint*> (__first=0x1555547add26 <std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_Rb_tree_impl<std::less<unsigned int>, true>::_Rb_tree_impl()+52>, __last=0x34363ff274417fe, __result=0xbd833529e37c66ed) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/stl_algobase.h:440
#4  0x000015554b5f470c in std::copy<Legion::OffsetConstraint*, Legion::OffsetConstraint*> (__first=0x1555547add26 <std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_Rb_tree_impl<std::less<unsigned int>, true>::_Rb_tree_impl()+52>, __last=0x34363ff274417fe, __result=0xbd833529e37c66ed) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/stl_algobase.h:474
#5  0x000015554b5e1eb3 in std::vector<Legion::OffsetConstraint, std::allocator<Legion::OffsetConstraint> >::operator= (this=0x4414880, __x=...) at /cm/local/apps/gcc/9.2.0/include/c++/9.2.0/bits/vector.tcc:243
#6  0x000015554b5d25dd in Legion::LayoutConstraintSet::operator= (this=0x4414720) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion_constraint.h:746
#7  0x000015554bec6d51 in Legion::LayoutConstraintRegistrar::operator= (this=0x4414718) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion.h:2553
#8  0x000015554beb54e4 in Legion::Internal::Runtime::preregister_layout (registrar=..., layout_id=1048577) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:30112
#9  0x000015554ba0821b in Legion::Runtime::preregister_layout (registrar=..., layout_id=4294967295) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/legion.cc:8202
#10 0x00001555547a960e in S3DRank::get_fortran_soa_layout () at s3d_rank_mpi.cc:657
#11 0x00001555547acfac in RegisterCPUVariant<InitTemperatureTask, S3DTask<InitTemperatureTask, 3>, true>::register_variant () at s3d_task.h:810
#12 0x00001555547abdd5 in S3DTask<InitTemperatureTask, 3>::register_variants () at s3d_task.h:991
#13 0x00001555547a74ed in S3DRank::start_legion (this=0x4413c30) at s3d_rank_mpi.cc:143
#14 0x00001555547a651d in initialize_rhsf_legion_ (mechanism_name=0x645618 "C7H16_52species", local_grid=0x7fffffff9e24, global_grid=0x7fffffff9e30, proc_grid=0x7fffffff9e3c, proc_id=0x7fffffff9e48, vary_in_dims=0x7fffffff9e54, p_nvar_tot=0x888f7c <__param_m_MOD_nvar_tot>, p_n_spec=0x888f9c <__param_m_MOD_n_spec>, p_n_reg=0x888fa0 <__param_m_MOD_n_reg>, p_n_scalar=0x6455e4, p_iorder=0x888fb0 <__param_m_MOD_iorder>, p_iforder=0x888fb8 <__param_m_MOD_iforder>, p_n_steps=0x88cb28 <__runtime_m_MOD_i_time_end>, p_n_stages=0x4091620 <__rk_m_MOD_nstage>, p_lagging_switch=0x4091890 <__transport_m_MOD_lagging_switch>, p_lag_steps=0x4091894 <__transport_m_MOD_lag_steps>, p_npts=0x6455f8, p_tempmin=0x6455f0, p_tempmax=0x6455e8, cpCoeff_aa=0x155511dd4010, cpCoeff_bb=0x155511d08010, enthCoeff_aa=0x155511f6c010, enthCoeff_bb=0x155511ea0010, p_i_react=0x88cbe8 <__thermchem_m_MOD_i_react>, p_i_time_save=0x88cb1c <__runtime_m_MOD_i_time_save>, p_i_time_mon=0x88cb24 <__runtime_m_MOD_i_time_mon>, p_i_time_res=0x88cb20 <__runtime_m_MOD_i_time_res>, p_i_time_tec=0x88cb18 <__runtime_m_MOD_i_time_tec>, p_i_time_fil=0x88a9f0 <__filter_m_MOD_i_time_fil>, p_unif_grid_dims=0x7fffffff9e60, scale_1x=0x4236e30, scale_1y=0x41b93c0, scale_1z=0x41ebba0, reference_values=0x7fffffff9f40, periodic_flags=0x7fffffff9e70, bc_types=0x7fffffff9ea0, x_min=0x7fffffff9ec0, x_max=0x7fffffff9ee0, relax_ct=0x8893e0 <__bc_m_MOD_relax_ct>) at rhst_fortran.cc:130
#15 0x00000000005d3d49 in solve_driver (io=6) at /lustre/scratch/vsyamaj/legion_s3d_subranks/s3d/source/drivers/solve_driver.f90:194
#16 0x00000000005d36a2 in s3d () at /lustre/scratch/vsyamaj/legion_s3d_subranks/s3d/source/drivers/main.f90:131
#17 0x0000000000404b8d in main (argc=<optimized out>, argv=<optimized out>) at /lustre/scratch/vsyamaj/legion_s3d_subranks/s3d/source/drivers/main.f90:8
#18 0x00001555527746a3 in __libc_start_main () from /lib64/libc.so.6
#19 0x0000000000404bce in _start () at /lustre/scratch/vsyamaj/legion_s3d_subranks/s3d/source/drivers/main.f90:8
lightsighter commented 8 months ago

This looks like memory corruption to me. This is before we've even started Legion and we're trying to move an STL vector from one place to another. STL failures like this are almost always memory corruption. Usually when they happen so early in the program though they are not hard to find.

lightsighter commented 8 months ago

FWIW, LayoutConstraintSet::operator= isn't even code we've written, it is a compiler-generated operator implementation.

syamajala commented 8 months ago

I tried running with valgrind and I'm seeing stuff like this:

==1908013== Conditional jump or move depends on uninitialised value(s)
==1908013==    at 0xD9DDA5E: std::vector<Legion::OffsetConstraint, std::allocator<Legion::OffsetConstraint> >::operator=(std::vector<Legion::OffsetConstraint, std::allocator<Legion::OffsetConstraint> > const&) (vector.tcc:224)
==1908013==    by 0xD9CF8E6: Legion::LayoutConstraintSet::operator=(Legion::LayoutConstraintSet const&) (legion_constraint.h:746)
==1908013==    by 0xE2982B4: Legion::LayoutConstraintRegistrar::operator=(Legion::LayoutConstraintRegistrar const&) (legion.h:2553)
==1908013==    by 0xE286AD5: Legion::Internal::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (runtime.cc:30112)
==1908013==    by 0xDDDFA34: Legion::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (legion.cc:8202)
==1908013==    by 0x585B977: S3DRank::get_fortran_soa_layout() (s3d_rank_mpi.cc:672)
==1908013==    by 0x585F369: RegisterCPUVariant<InitTemperatureTask, S3DTask<InitTemperatureTask, 3>, true>::register_variant() (s3d_task.h:808)
==1908013==    by 0x585E13E: S3DTask<InitTemperatureTask, 3>::register_variants() (s3d_task.h:989)
==1908013==    by 0x585984C: S3DRank::start_legion() (s3d_rank_mpi.cc:156)
==1908013==    by 0x585887C: initialize_rhsf_legion_ (rhst_fortran.cc:130)
==1908013==    by 0x5B3258: solve_driver_ (solve_driver.f90:194)
==1908013==    by 0x5B2BB1: MAIN__ (main.f90:131)
==1908013==  Uninitialised value was created by a heap allocation
==1908013==    at 0x4C9FD8B: operator new(unsigned long) (vg_replace_malloc.c:417)
==1908013==    by 0xE37993C: __gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::allocate(unsigned long, void const*) (new_allocator.h:114)
==1908013==    by 0xE36B191: std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > > >::allocate(std::allocator<std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >&, unsigned long) (alloc_traits.h:444)
==1908013==    by 0xE350C34: std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_get_node() (stl_tree.h:580)
==1908013==    by 0xE327ED6: std::_Rb_tree_node<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >* std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_create_node<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) (stl_tree.h:630)
==1908013==    by 0xE2E39C1: std::_Rb_tree_iterator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > std::_Rb_tree<unsigned long, std::pair<unsigned long const, Legion::LayoutConstraintRegistrar>, std::_Select1st<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> >, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) (stl_tree.h:2455)
==1908013==    by 0xE2B9A26: std::map<unsigned long, Legion::LayoutConstraintRegistrar, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Legion::LayoutConstraintRegistrar> > >::operator[](unsigned long const&) (stl_map.h:499)
==1908013==    by 0xE286AC0: Legion::Internal::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (runtime.cc:30112)
==1908013==    by 0xDDDFA34: Legion::Runtime::preregister_layout(Legion::LayoutConstraintRegistrar const&, unsigned long) (legion.cc:8202)
==1908013==    by 0x585B977: S3DRank::get_fortran_soa_layout() (s3d_rank_mpi.cc:672)
==1908013==    by 0x585F369: RegisterCPUVariant<InitTemperatureTask, S3DTask<InitTemperatureTask, 3>, true>::register_variant() (s3d_task.h:808)
==1908013==    by 0x585E13E: S3DTask<InitTemperatureTask, 3>::register_variants() (s3d_task.h:989)

There is a full log here: http://sapling2.stanford.edu/~seshu/s3d_stencil/valgrind.txt

elliottslaughter commented 8 months ago

I've been staring at this, trying to figure out how it's happening, and so far don't have anything.

Valgrind and the segfault are in agreement, so at the moment we have no reason to distrust what it's telling us. According to valgrind, this is the first error we hit, so there is no memory corruption prior to this point. I would also add that this is so early in the program that nothing in Regent has been initialized yet. We're all in C++ code at this point.

The line that valgrind reports agrees with what we saw in the crash:

https://github.com/StanfordLegion/legion/blob/a790370b366a86be395db13d78dc18364f4b8e98/runtime/legion/runtime.cc#L30112

Looks about as straightforward as it gets. operator[] allocates the map entry (running a default constructor) and operator= assigns to it.

The pending_constraint_table comes from a local static variable in this method:

https://github.com/StanfordLegion/legion/blob/a790370b366a86be395db13d78dc18364f4b8e98/runtime/legion/runtime.cc#L32029-L32037

Again, hard to see anything going wrong there.

So the only thing I can figure is the constructor is somehow bad... but if it is, I don't see it. The LayoutConstraintRegistrar here should implicitly default construct the layout_constraint field:

https://github.com/StanfordLegion/legion/blob/a790370b366a86be395db13d78dc18364f4b8e98/runtime/legion/legion.cc#L2107-L2112

And similarly LayoutConstraintSet has a constructor here that should implicitly initialize all fields:

https://github.com/StanfordLegion/legion/blob/a790370b366a86be395db13d78dc18364f4b8e98/runtime/legion/legion_constraint.h#L748

Just to be sure I wasn't getting the C++ semantics confused, I went and looked them up:

Default-initialization is performed in three situations: ... 3) when a base class or a non-static data member is not mentioned in a constructor initializer list and that constructor is called.

https://en.cppreference.com/w/cpp/language/default_initialization

So yeah, that should be fine.

The application code in question is also so simple that I can't see any way for it to be wrong, but I'll post it here in case @lightsighter sees something:

/*static*/ LayoutConstraintID S3DRank::get_fortran_soa_layout(void)
{
  static LayoutConstraintID layout_id = 0;
  if (layout_id > 0)
    return layout_id;
  // We haven't made the constraint set before so do it now
  LayoutConstraintRegistrar constraints;
  // This should be a normal instance
  constraints.add_constraint(SpecializedConstraint(NORMAL_SPECIALIZE));
  // Want fortran orldering of dimensions
  std::vector<DimensionKind> dim_order(4);
  dim_order[0] = DIM_X;
  dim_order[1] = DIM_Y;
  dim_order[2] = DIM_Z;
  dim_order[3] = DIM_F; // SOA: fields are least quickly changing
  constraints.add_constraint(OrderingConstraint(dim_order, true/*contiguous*/));
  layout_id = Runtime::preregister_layout(constraints);
  return layout_id;
}

So I'm thoroughly stumped at this point.

syamajala commented 8 months ago

I guess MAX_DIM is set in 2 places in S3D and I only updated one of them.