Segmentation fault in multi-shift CG with refinement in the HMC

kostrzewa commented 2 years ago

For the following monomial (as an example)

BeginMonomial NDCLOVERRAT
  Timescale = 1 
  kappa = 0.1400645
  CSW = 1.74
  AcceptancePrecision =  1e-21
  ForcePrecision = 1e-16
  StildeMin = 0.0000376
  StildeMax = 4.7 
  MaxSolverIterations = 500 
  Name = ndcloverrat_0_3
  DegreeOfRational = 10
  Cmin = 0 
  Cmax = 3 
  ComputeEVFreq = 0 
  2Kappamubar = 0.0394421632
  2Kappaepsbar = 0.0426076209
  AddTrLog = yes 
  useexternalinverter = quda
  usesloppyprecision = single
  solver = cgmmsnd
EndMonomial

the QUDA interface launches QUDA's multi-shift solver in "refinement" mode (since sloppy precision is set to single). It proceeds in two steps:

multi-shift CG is run up to single precision
unconverged shifts are refined using double-single mixed-precision CG

While this has worked in the past, it seems that the latest commits of the develop branch have broken this behaviour.

# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Time for reorder_gauge_toQuda 2.820775e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat_0_3:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/reorder_gauge_toQuda
# TM_QUDA: Time for loadGaugeQuda 2.889670e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat_0_3:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/loadGaugeQuda
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 3.600394e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat_0_3:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/loadCloverQuda
# TM_QUDA: mu = 0.140800000000, epsilon = 0.152100000000 kappa = 0.140064500000, csw = 1.740000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.010873e-02 s level: 5 proc_id: 0 /HMC/ndcloverrat_0_3:ndrat_heatbath/solve_mms_nd_plus/solve_mms_nd/invert_eo_quda_twoflavour_mshift/reorder_spinor_eo_toQuda
# QUDA: MultiShift CG: Converged after 23 iterations
# QUDA:  shift=0, 23 iterations, relative residual: iterated = 1.963668e-05
# QUDA:  shift=1, 23 iterations, relative residual: iterated = 9.198576e-10
# QUDA:  shift=2, 11 iterations, relative residual: iterated = 1.941555e-09
# QUDA:  shift=3, 5 iterations, relative residual: iterated = 1.438791e-09
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
[cassiopeia:3629647] *** Process received signal ***
[cassiopeia:3629647] Signal: Segmentation fault (11)
[cassiopeia:3629647] Signal code: Address not mapped (1)
[cassiopeia:3629647] Failing at address: (nil)
[cassiopeia:3629646] *** Process received signal ***
[cassiopeia:3629646] Signal: Segmentation fault (11)
[cassiopeia:3629646] Signal code: Address not mapped (1)
[cassiopeia:3629646] Failing at address: (nil)
[cassiopeia:3629646] [cassiopeia:3629647] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x15420)[0x7e15f2ba3420]
[cassiopeia:3629647] [ 1] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x15420)[0x71b0b6086420]
[cassiopeia:3629646] [ 1] /home/bartek/build/quda-develop/install_dir/lib/libquda.so(_ZN4quda2CGclERNS_16ColorSpinorFieldES2_PS1_d+0x701)[0x7e15fac9a181]
/home/bartek/build/quda-develop/install_dir/lib/libquda.so(_ZN4quda2CGclERNS_16ColorSpinorFieldES2_PS1_d+0x701)[0x71b0be17d181]
[cassiopeia:3629646] [ 2] [cassiopeia:3629647] [ 2] /home/bartek/build/quda-develop/install_dir/lib/libquda.so(invertMultiShiftQuda+0x20b4)[0x71b0be223744]
[cassiopeia:3629646] [ 3] ../../hmc_tm(+0x44973)[0x5a39e6d42973]
[cassiopeia:3629646] /home/bartek/build/quda-develop/install_dir/lib/libquda.so(invertMultiShiftQuda+0x20b4)[0x7e15fad40744]
[cassiopeia:3629647] [ 3] ../../hmc_tm(+0x44973)[0x640704624973]
[cassiopeia:3629647] [ 4] ../../hmc_tm(+0x1d2526)[0x6407047b2526]
[cassiopeia:3629647] [ 5] ../../hmc_tm(+0x1d298f)[0x6407047b298f]
[cassiopeia:3629647] [ 6] ../../hmc_tm(+0x76e9f)[0x640704656e9f]
[cassiopeia:3629647] [ 7] ../../hmc_tm(+0x3160f)[0x64070461160f]
[cassiopeia:3629647] [ 8] ../../hmc_tm(+0x6749)[0x6407045e6749]
[cassiopeia:3629647] [ 9] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7e15f27320b3]
[cassiopeia:3629647] [10] ../../hmc_tm(+0x526e)[0x6407045e526e]
[cassiopeia:3629647] *** End of error message ***
../../hmc_tm(+0x1d2526)[0x5a39e6ed0526]
[cassiopeia:3629646] [ 5] ../../hmc_tm(+0x1d298f)[0x5a39e6ed098f]
[cassiopeia:3629646] [ 6] ../../hmc_tm(+0x76e9f)[0x5a39e6d74e9f]
[cassiopeia:3629646] [ 7] ../../hmc_tm(+0x3160f)[0x5a39e6d2f60f]
[cassiopeia:3629646] [ 8] ../../hmc_tm(+0x6749)[0x5a39e6d04749]
[cassiopeia:3629646] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x71b0b5c150b3]
[cassiopeia:3629646] [10] ../../hmc_tm(+0x526e)[0x5a39e6d0326e]
[cassiopeia:3629646] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cassiopeia exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

We have independent reproductions of the issue on Marconi 100 by @Marcogarofalo as well as on my development system.

kostrzewa commented 2 years ago

@Marcogarofalo I was able to confirm your observation. At least on my test machine the issue is independent of QUDA_ENABLE_P2P, QUDA_ENABLE_DEVICE_MEMORY_POOL or QUDA_ENABLE_PINNED_MEMORY_POOL. Since I use a different MPI version that you did on Marconi 100, I think we can safely exclude that as a cause as well.

kostrzewa commented 2 years ago

A workaround (which makes things a bit slower, however) is to use full double precision multi-shift CG for all heavy monomials:

BeginMonomial NDCLOVERRAT
  [...] 
  useexternalinverter = quda
  usesloppyprecision = double ### <- double precision only
  solver = cgmmsnd
EndMonomial

kostrzewa commented 2 years ago

Might be related to: https://github.com/etmc/tmLQCD/issues/501

kostrzewa commented 2 years ago

We should check if any new parameters were added to QudaInvertParam in the process of the ColorSpinorField unification.

kostrzewa commented 2 years ago

@sunpho84 fyi since you were in the mail thread as well

sunpho84 commented 2 years ago

yeah yeah I'm following. So if I understand the issue is not limited to Marconi, and is not related to new version of Quda, and is not related to MPI version

kostrzewa commented 2 years ago

I think it is related to QUDA. It should be possible to now git bisect QUDA (until about the point where feature/ndeg-twisted-clover was merged in) and figure out what causes this regression.

sunpho84 commented 2 years ago

Oh! I see, thanks, now I understand the logic of the bisect command! I had never looked into it. So one should take: https://github.com/qcdcode/quda/commit/460cce9e168472e61535b643f868948d7eff09f1 as good, and HEAD of develop as bad, right?

kostrzewa commented 2 years ago

Good news, I was able to reproduce this with QUDA's own tests:

QUDA_RESOURCE_PATH=$(pwd) \
./invert_test \
  --prec double \
  --prec-refine single \
  --prec-sloppy single \
  --multishift 3

will fail with the same problem.

kostrzewa commented 2 years ago

See https://github.com/lattice/quda/issues/1244

kostrzewa commented 2 years ago

Oh! I see, thanks, now I understand the logic of the bisect command! I had never looked into it. So one should take: https://github.com/qcdcode/quda/commit/460cce9e168472e61535b643f868948d7eff09f1 as good, and HEAD of develop as bad, right?

On latest commit of develop of https://github.com/lattice/quda

$ git bisect start
$ git bisect bad
$ git bisect good 1c46e8e945e6c99619b03e728dfc01b5c3f93029

and then the compile / test / bisect / compile / test cycle follows

kostrzewa commented 2 years ago

ps: I think I'm going to remove the qcdcode fork of QUDA, it leads to too much confusion. We can all get push access to the QUDA repo anyway (since all important branches are locked and can only be modified via pull-request).

Marcogarofalo commented 2 years ago

I am not sure how to do a pull request without, for the pull request #1234 I had to push from qcdcode I did not find other way

kostrzewa commented 2 years ago

I am not sure how to do a pull request without, for the pull request #1234 I had to push from qcdcode I did not find other way

If you can't push to lattice/quda we just need to ask Kate to give you access. Of course, you can also always fork the repo into your own fork and PR from there.

kostrzewa commented 2 years ago

The issue has been fixed in QUDA in 5431b168b09343503d0d676425069dc895879c92 of the develop branch and I can confirm that this fixes also our usecase here. In adiition, it might have resolved #501 too, so it would be worth trying to see if double-half refinement is now possible.

kostrzewa commented 2 years ago

It seems that #501 was indeed resolved. I've thus submitted #521 for testing which re-exposes the refinement precision as a configurable parameter.

etmc / tmLQCD

Segmentation fault in multi-shift CG with refinement in the HMC #520