GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
222 stars 89 forks source link

VTK mesh redistribution with default partitionRefinement (segmentatation fault) #2821

Open castelletto1 opened 11 months ago

castelletto1 commented 11 months ago

Describe the bug Running GEOS in serial providing the mesh as VTK produces segmentation fault if the number of partitioning refinement iterations (partitionRefinement) is not 0 (default value 1).

To Reproduce See incompressible single-phase flow example below.

@francoishamon, @klevzoff, @untereiner : have you observed this behavior before?

<?xml version="1.0" ?>

<Problem>
  <Solvers
    gravityVector="{ 0.0, 0.0, 0.0 }">
    <SinglePhaseFVM
      name="SinglePhaseFlow"
      logLevel="1"
      discretization="singlePhaseTPFA"
      targetRegions="{ Domain }">
      <NonlinearSolverParameters
        newtonTol="1.0e-6"
        newtonMaxIter="8"/>
      <LinearSolverParameters
        directParallel="0"/>
    </SinglePhaseFVM>
  </Solvers>

  <Mesh>
    <VTKMesh
      name="mesh"
      partitionRefinement="1"
      file="mesh.vtk"/>
  </Mesh>

  <Events
    maxTime="1.0">
    <PeriodicEvent
      name="outputs"
      timeFrequency="1.0"
      target="/Outputs/vtkOutput"/>

    <PeriodicEvent
      name="solverApplications"
      forceDt="1.0"
      target="/Solvers/SinglePhaseFlow"/>

  </Events>

  <NumericalMethods>
    <FiniteVolume>
      <TwoPointFluxApproximation
        name="singlePhaseTPFA"/>
    </FiniteVolume>
  </NumericalMethods>

  <ElementRegions>
    <CellElementRegion
      name="Domain"
      cellBlocks="{ 0_hexahedra, 1_hexahedra, 2_hexahedra }"
      materialList="{ water, rock }"/>
  </ElementRegions>

  <Constitutive>
    <CompressibleSinglePhaseFluid
      name="water"
      defaultDensity="1000"
      defaultViscosity="0.001"
      referencePressure="0.0"
      compressibility="0.0"
      viscosibility="0.0"/>

    <CompressibleSolidConstantPermeability
      name="rock"
      solidModelName="nullSolid"
      porosityModelName="rockPorosity"
      permeabilityModelName="rockPerm"/>

    <NullModel
      name="nullSolid"/>

    <PressurePorosity
      name="rockPorosity"
      defaultReferencePorosity="0.05"
      referencePressure="0.0"
      compressibility="0.0"/>

    <ConstantPermeability
      name="rockPerm"
      permeabilityComponents="{ 1.0e-15, 1.0e-15, 1.0e-15 }"/>
  </Constitutive>

  <FieldSpecifications>

   <FieldSpecification
      name="sourceTerm"
      objectPath="ElementRegions/Domain/0_hexahedra"
      fieldName="pressure"
      scale="5e6"
      setNames="{ all }"/>

   <FieldSpecification
      name="sinkTerm"
      objectPath="ElementRegions/Domain/2_hexahedra"
      fieldName="pressure"
      scale="-5e6"
      setNames="{ all }"/>
  </FieldSpecifications>

  <Outputs>
    <VTK
      name="vtkOutput"/>
  </Outputs>
</Problem>
# vtk DataFile Version 5.0
vtk domain
ASCII
DATASET STRUCTURED_POINTS

FIELD FieldData 1
CellLabels 2 3 string
0
injector
1
domain
2 
producer

DIMENSIONS 6 6 2
ORIGIN 0 0 0
SPACING 1 1 1

CELL_DATA 25
SCALARS attribute int 1
LOOKUP_TABLE default
0 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 2
klevzoff commented 11 months ago

According to ParMETIS docs

The routines must be called by at least two processors. That is, ParMETIS cannot be used on a single processor.

We should have an early MPI size check and exit in any case, to avoid doing extra work (e.g. building the graph) in this extremely common case. I'm surprised we don't - we have integrated tests that use VTK meshes.

EDIT: ParMETIS wrapper routine does have this check, so something else crashes. Do you have a stacktrace?

castelletto1 commented 11 months ago

EDIT: ParMETIS wrapper routine does have this check, so something else crashes. Do you have a stacktrace?

Adding Solver of type SinglePhaseFVM, named SinglePhaseFlow
Adding Mesh: VTKMesh, mesh
Adding Event: PeriodicEvent, outputs
Adding Event: PeriodicEvent, solverApplications
Adding Output: VTK, vtkOutput
Adding Object CellElementRegion named Domain from ObjectManager::Catalog.
VTKMesh 'mesh': reading mesh from /usr/WS1/castel/geos_develop/flow/reproducer/mesh.vtk
Generating global Ids from VTK mesh
Received signal 11: Segmentation fault

** StackTrace of 12 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: geos::vtk::redistribute(vtkPartitionedDataSet&, int) 
Frame 2: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, int, int) 
Frame 3: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, int, geos::vtk::PartitionMethod, int, int) 
Frame 4: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 5: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 6: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 7: geos::ProblemManager::generateMesh() 
Frame 8: geos::ProblemManager::problemSetup() 
Frame 9: geos::GeosxState::initializeDataRepository() 
Frame 10: main 
Frame 11: __libc_start_main 
Frame 12: _start 
paveltomin commented 11 months ago

Same thing on my side

VTKMesh 'mesh': reading mesh from /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel/mesh/egg_withBurdens_small.vts
Generating global Ids from VTK mesh
Received signal 11: Segmentation fault

** StackTrace of 12 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: geos::vtk::redistribute(vtkPartitionedDataSet&, ompi_communicator_t*) 
Frame 2: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, ompi_communicator_t*, int) 
Frame 3: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, ompi_communicator_t*, geos::vtk::PartitionMethod, int, int) 
Frame 4: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 5: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 6: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 7: geos::ProblemManager::generateMesh() 
Frame 8: geos::ProblemManager::problemSetup() 
Frame 9: geos::GeosxState::initializeDataRepository() 
paveltomin commented 11 months ago

@klevzoff @TotoGaz any simple fix?

klevzoff commented 11 months ago

@klevzoff @TotoGaz any simple fix?

Early return in redistributeMeshes when numRanks == 1 should be sufficient I think?

paveltomin commented 11 months ago

thanks @klevzoff , I tried it for my case and then realized that it also crashes for multi-rank run case is from here https://github.com/GEOS-DEV/GEOS/tree/swaziri/nAcceleration/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel it stopped working after https://github.com/GEOS-DEV/GEOS/pull/2580 (FYI @TotoGaz)

castelletto1 commented 11 months ago

thanks @klevzoff , I tried it for my case and then realized that it also crashes for multi-rank run case is from here https://github.com/GEOS-DEV/GEOS/tree/swaziri/nAcceleration/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel it stopped working after #2580 (FYI @TotoGaz)

I confirm it does crash even for parallel runs. The incompressible single-phase posted above will exhibit the same behavior in serial and parallel.

TotoGaz commented 11 months ago

thanks @klevzoff , I tried it for my case and then realized that it also crashes for multi-rank run case is from here https://github.com/GEOS-DEV/GEOS/tree/swaziri/nAcceleration/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel it stopped working after #2580 (FYI @TotoGaz)

Damned, that sucks, I'll have a look at it.

paveltomin commented 11 months ago

Similar thing happens for singlePhaseFlowFractures/fractureFlow_conforming_2d_vtk_input.xml

paveltomin commented 8 months ago

ping

castelletto1 commented 7 months ago

Parallel import has been improved in #3020.

@paveltomin we should check this again

paveltomin commented 7 months ago

Similar thing happens for singlePhaseFlowFractures/fractureFlow_conforming_2d_vtk_input.xml

This still crashes (serial run) but seems like a new error:

VTKMesh 'mesh1': reading mesh from /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/inputFiles/singlePhaseFlowFractures/tShapedFracturedCube.vtm
Using global Ids defined in VTK mesh
Received signal 11: Segmentation fault

** StackTrace of 14 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/build-CPU-OPTO2-Hypre-GCC_10.2.0-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so 
Frame 2: GOMP_parallel 
Frame 3: LvArray::ArrayOfArrays<long, long, LvArray::ChaiBuffer> geos::vtk::buildElemToNodesImpl<long, RAJA::PolicyBaseT<(RAJA::Policy)3, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(geos::vtk::AllMeshes&, vtkSmartPointer<vtkCellArray> const&) 
Frame 4: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, ompi_communicator_t*, int) 
Frame 5: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, ompi_communicator_t*, geos::vtk::PartitionMethod, int, int) 
Frame 6: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 7: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 8: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 9: geos::ProblemManager::generateMesh() 
Frame 10: geos::ProblemManager::problemSetup() 
Frame 11: geos::GeosxState::initializeDataRepository() 
Frame 12: main 

And for parallel run:

VTKMesh 'mesh1': reading mesh from /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/inputFiles/singlePhaseFlowFractures/tShapedFracturedCube.vtm
Using global Ids defined in VTK mesh
Received signal 11: Segmentation fault

** StackTrace of 14 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/build-CPU-OPTO2-Hypre-GCC_10.2.0-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so 
Frame 2: GOMP_parallel 
Frame 3: LvArray::ArrayOfArrays<long, long, LvArray::ChaiBuffer> geos::vtk::buildElemToNodesImpl<long, RAJA::PolicyBaseT<(RAJA::Policy)3, (RAJA::Pattern)1, (RAJA::Launch)0, (camp::resources::v1::Platform)1, RAJA::policy::omp::Parallel, RAJA::wrapper<RAJA::policy::omp::omp_for_schedule_exec<RAJA::policy::omp::Auto> > > >(geos::vtk::AllMeshes&, vtkSmartPointer<vtkCellArray> const&) 
Frame 4: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, ompi_communicator_t*, int) 
Frame 5: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, ompi_communicator_t*, geos::vtk::PartitionMethod, int, int) 
Frame 6: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 7: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 8: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 9: geos::ProblemManager::generateMesh() 
Frame 10: geos::ProblemManager::problemSetup() 
Frame 11: geos::GeosxState::initializeDataRepository() 
Frame 12: main 
paveltomin commented 7 months ago

thanks @klevzoff , I tried it for my case and then realized that it also crashes for multi-rank run case is from here https://github.com/GEOS-DEV/GEOS/tree/swaziri/nAcceleration/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel it stopped working after #2580 (FYI @TotoGaz)

That case also still crashes (serial run), same error:

VTKMesh 'mesh': reading mesh from /data/rpo_ptls/GEOSX/residual_flash/GEOS/GEOS/inputFiles/poromechanics/nonlinearAcceleration/smallEggModel/egg_withBurdens_small.vts
Generating global Ids from VTK mesh
Received signal 11: Segmentation fault

** StackTrace of 12 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: geos::vtk::redistribute(vtkPartitionedDataSet&, ompi_communicator_t*) 
Frame 2: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, ompi_communicator_t*, int) 
Frame 3: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, ompi_communicator_t*, geos::vtk::PartitionMethod, int, int) 
Frame 4: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 5: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 6: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 7: geos::ProblemManager::generateMesh() 
Frame 8: geos::ProblemManager::problemSetup() 
Frame 9: geos::GeosxState::initializeDataRepository() 
Frame 10: main 
paveltomin commented 3 weeks ago

getting this error randomly - one run fails, another can pass image

untereiner commented 3 weeks ago

for each run, same parameter set ? same number of ranks ? Regular Grid ?

paveltomin commented 3 weeks ago

for each run, same parameter set ? same number of ranks ? Regular Grid ?

same parameter set, same grid, same number of ranks grid is hex grid, not trivial but also not that crazy, normal reservoir simulation grid

paveltomin commented 3 weeks ago

getting this error randomly - one run fails, another can pass image

longer version:

VTKMesh 'mesh': reading mesh from /data/******.vtu
  reading the dataset...
  redistributing mesh...
Generating global Ids from VTK mesh
[1728484058.800276] [ccnpuscm00001p:5601 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x413e8b0, length=69888, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.800334] [ccnpuscm00001p:5601 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x413e8b0 (host) length 69888 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.800380] [ccnpuscm00001p:5601 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 159; team_id 32771; errmsg Input/output error
[1728484058.800872] [ccnpuscm00001p:5602 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x46825b0, length=59904, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.800929] [ccnpuscm00001p:5602 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x46825b0 (host) length 59904 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.800971] [ccnpuscm00001p:5602 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 161; team_id 32771; errmsg Input/output error
[1728484058.801527] [ccnpuscm00001p:5611 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x42cff10, length=53824, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.801572] [ccnpuscm00001p:5611 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x42cff10 (host) length 53824 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.801606] [ccnpuscm00001p:5611 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 171; team_id 32771; errmsg Input/output error
[1728484058.803588] [ccnpuscm00001p:5585 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x4ded410, length=51200, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.803623] [ccnpuscm00001p:5585 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x4ded410 (host) length 51200 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.803644] [ccnpuscm00001p:5585 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 146; team_id 32771; errmsg Input/output error
[1728484058.803845] [ccnpuscm00001p:5606 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x3288830, length=58752, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.803908] [ccnpuscm00001p:5606 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x3288830 (host) length 58752 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.803953] [ccnpuscm00001p:5606 :0]     tl_ucp_coll.c:142  TL_UCP ERROR failure in recv completion Input/output error
[1728484058.804979] [ccnpuscm00001p:5592 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x4bc9500, length=62720, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.805036] [ccnpuscm00001p:5592 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x4bc9500 (host) length 62720 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.805073] [ccnpuscm00001p:5592 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 158; team_id 32771; errmsg Input/output error
[1728484058.807817] [ccnpuscm00001p:5563 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x49d2d40, length=56576, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.807872] [ccnpuscm00001p:5563 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x49d2d40 (host) length 56576 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.807912] [ccnpuscm00001p:5563 :0] tl_ucp_sendrecv.h:171  TL_UCP ERROR tag 8; dest 121; team_id 32771; errmsg Input/output error
[1728484058.808464] [ccnpuscm00001p:5600 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x3cb7ce0, length=48512, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.808499] [ccnpuscm00001p:5600 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x3cb7ce0 (host) length 48512 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.808525] [ccnpuscm00001p:5600 :0]     tl_ucp_coll.c:142  TL_UCP ERROR failure in recv completion Input/output error
[1728484058.809002] [ccnpuscm00001p:5579 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x1549935d8ce0, length=1728, access=0xf) failed: Resource temporarily unavailable
[1728484058.809806] [ccnpuscm00001p:5571 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x4c43530, length=1856, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.809841] [ccnpuscm00001p:5571 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x4c43530 (host) length 1856 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.809863] [ccnpuscm00001p:5571 :0] tl_ucp_sendrecv.h:108  TL_UCP ERROR tag 8; dest 226; team_id 32771; errmsg Input/output error
[1728484058.813577] [ccnpuscm00001p:5582 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x148f95ee1aa0, length=30208, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.813612] [ccnpuscm00001p:5582 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x148f95ee1aa0 (host) length 30208 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.813634] [ccnpuscm00001p:5582 :0] tl_ucp_sendrecv.h:108  TL_UCP ERROR tag 8; dest 232; team_id 32771; errmsg Input/output error
[1728484058.813722] [ccnpuscm00001p:5579 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x1549935d8ce0 (host) length 1728 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.813748] [ccnpuscm00001p:5579 :0] tl_ucp_sendrecv.h:108  TL_UCP ERROR tag 8; dest 233; team_id 32771; errmsg Input/output error
[1728484058.814055] [ccnpuscm00001p:5572 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x4229ec0, length=55296, access=0x10000f) failed: Resource temporarily unavailable
[1728484058.814110] [ccnpuscm00001p:5572 :0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x4229ec0 (host) length 55296 on md[4]=mlx5_ib0: Input/output error (md supports: host)
[1728484058.814156] [ccnpuscm00001p:5572 :0]     tl_ucp_coll.c:142  TL_UCP ERROR failure in recv completion Input/output error
[1728484058.814528] [ccnpuscm00001p:5532 :0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x57e2d90, length=72960, access=0x10000f) failed: Resource temporarily unavailable
Received signal 11: Segmentation fault

** StackTrace of 22 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: ucp_request_cancel 
Frame 2: /chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ucc/lib/ucc/libucc_tl_ucp.so 
Frame 3: /chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ucc/lib/ucc/libucc_tl_ucp.so 
Frame 4: /chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ucc/lib/libucc.so.1 
Frame 5: ucc_context_progress 
Frame 6: mca_coll_ucc_alltoallv 
Frame 7: PMPI_Alltoallv 
Frame 8: wrap_MPI_Alltoallv 
Frame 9: libparmetis__gkMPI_Alltoallv 
Frame 10: ParMETIS_V3_Mesh2Dual 
Frame 11: geos::parmetis::meshToDual(LvArray::ArrayOfArraysView<long const, long const, true, LvArray::ChaiBuffer> const&, LvArray::ArrayView<long const, 1, 0, int, LvArray::ChaiBuffer> const&, ompi_communicator_t*, int) 
Frame 12: geos::vtk::redistributeByCellGraph(geos::vtk::AllMeshes&, geos::vtk::PartitionMethod, ompi_communicator_t*, int) 
Frame 13: geos::vtk::redistributeMeshes(int, vtkSmartPointer<vtkDataSet>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, vtkSmartPointer<vtkDataSet>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, vtkSmartPointer<vtkDataSet> > > >&, ompi_communicator_t*, geos::vtk::PartitionMethod, int, int) 
Frame 14: geos::VTKMeshGenerator::fillCellBlockManager(geos::CellBlockManager&, geos::SpatialPartition&) 
Frame 15: geos::MeshGeneratorBase::generateMesh(geos::dataRepository::Group&, geos::SpatialPartition&) 
Frame 16: geos::MeshManager::generateMeshes(geos::DomainPartition&) 
Frame 17: geos::ProblemManager::generateMesh() 
Frame 18: geos::ProblemManager::problemSetup() 
Frame 19: geos::GeosxState::initializeDataRepository() 
Frame 20: main 
Frame 21: __libc_start_main 
Frame 22: _start 
=====
untereiner commented 3 weeks ago

CPU or GPU ?

paveltomin commented 3 weeks ago

CPU or GPU ?

CPU

untereiner commented 3 weeks ago

Did something change in your infrastructure ? UCX version? OpenMPI version ?

drmichaeltcvx commented 3 weeks ago

As you can see here we used HPC_X v2.20, that brings in :

HPC-X v2.20
clusterkit-3312df7  1.14.462 (3312df7)
hcoll-6f14f25  4.8.3228 (6f14f25)
nccl_rdma_sharp_plugin-b246b19  2.7 (b246b19)
ompi-41ba5192d22a44a2f6beb3a176bc6cc59a896511  gitclone (41ba519)
sharp-aaa5caab26e3e785f65f88d514807ed51ff24b7d  3.8.0 (aaa5caa)
ucc-a0c139fe1e91b28681018a196e53510044322530  1.4.0 (a0c139f)
ucx-39c8f9b  1.17.0 (39c8f9b)
Linux: redhat8
OFED: mlnx_ofed
Build #: 804
gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3)
CUDA: V12.5.82

We built this GEOS using GCC 13.2

The run took place on 2 AMD nodes each with 120 Zen3 cores (Azure SKU HBv3)

drmichaeltcvx commented 3 weeks ago

Some times I also experience UCX crashing when we use as many MPI ranks as physical cores.

drmichaeltcvx commented 3 weeks ago

Has anyone else notices any UCX failures? This is the most recent public HPC_X.