Closed kostrzewa closed 2 years ago
@Marcogarofalo I was able to confirm your observation. At least on my test machine the issue is independent of QUDA_ENABLE_P2P
, QUDA_ENABLE_DEVICE_MEMORY_POOL
or QUDA_ENABLE_PINNED_MEMORY_POOL
. Since I use a different MPI version that you did on Marconi 100, I think we can safely exclude that as a cause as well.
A workaround (which makes things a bit slower, however) is to use full double precision multi-shift CG for all heavy monomials:
BeginMonomial NDCLOVERRAT
[...]
useexternalinverter = quda
usesloppyprecision = double ### <- double precision only
solver = cgmmsnd
EndMonomial
Might be related to: https://github.com/etmc/tmLQCD/issues/501
We should check if any new parameters were added to QudaInvertParam
in the process of the ColorSpinorField
unification.
@sunpho84 fyi since you were in the mail thread as well
yeah yeah I'm following. So if I understand the issue is not limited to Marconi, and is not related to new version of Quda, and is not related to MPI version
I think it is related to QUDA. It should be possible to now git bisect
QUDA (until about the point where feature/ndeg-twisted-clover
was merged in) and figure out what causes this regression.
Oh! I see, thanks, now I understand the logic of the bisect command! I had never looked into it. So one should take:
https://github.com/qcdcode/quda/commit/460cce9e168472e61535b643f868948d7eff09f1
as good, and HEAD of develop
as bad, right?
Good news, I was able to reproduce this with QUDA's own tests:
QUDA_RESOURCE_PATH=$(pwd) \
./invert_test \
--prec double \
--prec-refine single \
--prec-sloppy single \
--multishift 3
will fail with the same problem.
Oh! I see, thanks, now I understand the logic of the bisect command! I had never looked into it. So one should take:
https://github.com/qcdcode/quda/commit/460cce9e168472e61535b643f868948d7eff09f1
as good, and HEAD ofdevelop
as bad, right?
On latest commit of develop
of https://github.com/lattice/quda
$ git bisect start
$ git bisect bad
$ git bisect good 1c46e8e945e6c99619b03e728dfc01b5c3f93029
and then the compile / test / bisect / compile / test cycle follows
ps: I think I'm going to remove the qcdcode fork of QUDA, it leads to too much confusion. We can all get push access to the QUDA repo anyway (since all important branches are locked and can only be modified via pull-request).
I am not sure how to do a pull request without, for the pull request #1234 I had to push from qcdcode I did not find other way
I am not sure how to do a pull request without, for the pull request #1234 I had to push from qcdcode I did not find other way
If you can't push to lattice/quda
we just need to ask Kate to give you access. Of course, you can also always fork the repo into your own fork and PR from there.
The issue has been fixed in QUDA in 5431b168b09343503d0d676425069dc895879c92 of the develop
branch and I can confirm that this fixes also our usecase here. In adiition, it might have resolved #501 too, so it would be worth trying to see if double-half refinement is now possible.
It seems that #501 was indeed resolved. I've thus submitted #521 for testing which re-exposes the refinement precision as a configurable parameter.
For the following monomial (as an example)
the QUDA interface launches QUDA's multi-shift solver in "refinement" mode (since sloppy precision is set to
single
). It proceeds in two steps:While this has worked in the past, it seems that the latest commits of the
develop
branch have broken this behaviour.We have independent reproductions of the issue on Marconi 100 by @Marcogarofalo as well as on my development system.