etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

Split gauge update #502

Closed urbach closed 2 years ago

urbach commented 2 years ago

some steps towards gauge update on GPUs

urbach commented 2 years ago

@kostrzewa commented on this pull request.

@@ -89,6 +89,13 @@ typedef enum ExternalInverter_s { QPHIX_INVERTER } ExternalInverter;

+/ enumeration type for the external inverter /

@urbach does it perhaps make sense to merge this and ExternalInverter ?

thought about it, but then decided against it.

ExternalLibrary is going to put the whole GaugeUpdate on the GPU, I hope. In the other case only the inverter is on the GPU. So, in my opinion two very different things. However, I can easily change this, if there are good arguments.

kostrzewa commented 2 years ago

@urbach should we split the work on this somehow? In principle I have some time available in the next two weeks here or there.

urbach commented 2 years ago

@urbach should we split the work on this somehow? In principle I have some time available in the next two weeks here or there.

sure that would be great. I need to understand first, what the QUDA routine actually does. Maybe @sbacchio can help there?

sbacchio commented 2 years ago

Hi yes I can also help. Do you want to have any discussion on how to distribute it? and what you have in mind for now?

urbach commented 2 years ago

Hi yes I can also help. Do you want to have any discussion on how to distribute it? and what you have in mind for now?

discussion would be good. Bartek has certainly more inside already, for the following steps are needed/unclear

urbach commented 2 years ago

what about a short meeting today at 1pm or 2pm?

sbacchio commented 2 years ago

Both timings would be fine for me

kostrzewa commented 2 years ago

Same here although I would slightly prefer 2pm

urbach commented 2 years ago

Same here although I would slightly prefer 2pm

then let's say 2pm.

sunpho84 commented 2 years ago

Hi

sorry I'm late, where can I join?

On 03/09/2021 10:01, Carsten Urbach wrote:

Same here although I would slightly prefer 2pm

then let's say 2pm.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etmc/tmLQCD/pull/502#issuecomment-912341660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANAL5SCZCP65XXWLLU4P3TUAB6GRANCNFSM5DB4VOIA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

kostrzewa commented 2 years ago

Just doing some dummy work to see if I can easily get a working call to the QUDA gauge force set up. Refinements can be made later (such as doing the projection onto the adjoint rep already in QUDA).

sbacchio commented 2 years ago

Hi Bartek, just for not doing the work twice, we are working on it... We have almost everything figured out and at some point early next week we are going to push here the implementation. @pittlerf is working on it with me

kostrzewa commented 2 years ago

Alright, good to know. I'll push my latest additions (which still result in a segfault unfortunately) in a separate branch and you can then force-push here if you'd like.

sbacchio commented 2 years ago

ok great!

kostrzewa commented 2 years ago

Would be good to have some WIP commits to look at if possible.

sbacchio commented 2 years ago

ok we have clean up to do, later today..

kostrzewa commented 2 years ago

no worries, I didn't mean to imply any pressure, I'm just really interested in what you came up with

sbacchio commented 2 years ago

@kostrzewa can we be added to etmc so we can push here directly? At the moment changes are under https://github.com/pittlerf/tmLQCD/tree/testing_gauge_force_ferenc

kostrzewa commented 2 years ago

@pittlerf was already a member and I've just added you, @sbacchio

kostrzewa commented 2 years ago

Thanks @pittlerf and @sbacchio. Needs some minor conflict resolution

sbacchio commented 2 years ago

We checked serially the current implementation and as far as we can tell it matches the gauge_derivative implementation. Now we have some issue running in parallel.. It dies in the initialization of QUDA.. still have to investigate further

kostrzewa commented 2 years ago

We checked serially the current implementation and as far as we can tell it matches the gauge_derivative implementation. Now we have some issue running in parallel.. It dies in the initialization of QUDA.. still have to investigate further

It (e.g. tmLQCD's hmc_tm or invert or offline_measurement) launches fine in parallel for me, but I have some dH issues when the gauge monomial is on the GPU (which are likely to do with the fact that I forgot to pull in the changes to QUDA that you've made w.r.t. to the double alignment). The lib is recompiling as we speak and I'll test again in a few minutes.

kostrzewa commented 2 years ago

I'm afraid that also with the additional commits in feature/ndeg-twisted-clover I see large dH in a pure-gauge run using the QUDA version of the gauge force.

sbacchio commented 2 years ago

Thanks for checking, but yes it is still WIP.. so far we did unit tests with our executable.

pittlerf commented 2 years ago

We checked serially the current implementation and as far as we can tell it matches the gauge_derivative implementation. Now we have some issue running in parallel.. It dies in the initialization of QUDA.. still have to investigate further

Now this is solved, for we had problems running interactively, using script it works also parallel

pittlerf commented 2 years ago

I'm afraid that also with the additional commits in feature/ndeg-twisted-clover I see large dH in a pure-gauge run using the QUDA version of the gauge force.

Hi Bartek, could we have your input script, we have implemented some checks and would like to try.

kostrzewa commented 2 years ago

Sure, I just do something like:

L=16
T=32

nrxprocs = 1
nryprocs = 1
nrzprocs = 1
ompnumthreads = 3

Measurements = 10000
StartCondition = hot

NSave = 500000
ThetaT = 1.0

UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 2

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.9
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

BeginIntegrator 
  Type0 = 2MN
  IntegrationSteps0 = 100
  Tau = 1.0
  Lambda0 = 0.193
  NumberOfTimescales = 1
EndIntegrator

for pure gauge.

pittlerf commented 2 years ago

Sure, I just do something like:

L=16
T=32

nrxprocs = 1
nryprocs = 1
nrzprocs = 1
ompnumthreads = 3

Measurements = 10000
StartCondition = hot

NSave = 500000
ThetaT = 1.0

UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 2

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.9
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

BeginIntegrator 
  Type0 = 2MN
  IntegrationSteps0 = 100
  Tau = 1.0
  Lambda0 = 0.193
  NumberOfTimescales = 1
EndIntegrator

for pure gauge.

Thank you, we also see the problem, now try to identify the issue

kostrzewa commented 2 years ago

For a full nf=2+1+1 twisted clover run (based on cA211.53.24 but on a 16c32 lattice instead):

NrXProcs = 1
NrYProcs = 1
NrZProcs = 1
ompnumthreads = 6

L=16
T=32

Measurements = 1000
# StartCondition = hot
StartCondition = continue
InitialStoreCounter = readin

2KappaMu = 0.0014846837
2KappaMuBar = 0.0394421632
2KappaEpsBar = 0.0426076209
CSW = 1.74
kappa = 0.1400645

NSave = 500000
ThetaT = 1.0
UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 2

# StrictResidualCheck = yes
UseRelativePrecision = yes

BeginExternalInverter QUDA
  Pipeline = 10
  gcrNkrylov = 20
  MGCoarseMuFactor = 1.0, 1.0, 20.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 24, 24
  MGSetupSolver = cg
  MGSetup2KappaMu = 0.0014846837
  MGVerbosity = silent, silent, silent
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.1, 0.1, 0.1
  MGCoarseMaxSolverIterations = 15, 15, 15
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGBlockSizesX = 4,2
  MGBlockSizesY = 4,2
  MGBlockSizesZ = 4,2
  MGBlockSizesT = 4,2
  MGOverUnderRelaxationFactor = 0.90, 0.90, 0.90

  MGResetSetupMDUThreshold = 1.0
  MGRefreshSetupMDUThreshold = 0.06249
  MGRefreshSetupMaxSolverIterations = 15, 15
EndExternalInverter

BeginMeasurement CORRELATORS
  Frequency = 1
EndMeasurement

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.726
  Timescale = 0
  UseExternalLibrary = no
EndMonomial

BeginMonomial CLOVERDET
  Timescale = 1
  kappa = 0.1400645
  2KappaMu = 0.0014846837
  CSW = 1.74
  rho = 0.09353509
  MaxSolverIterations = 1000
  AcceptancePrecision =  1.e-21
  ForcePrecision = 1.e-16
  Name = cloverdetlight
  solver = cg
  useexternalinverter = quda
  usesloppyprecision = half
EndMonomial

BeginMonomial CLOVERDETRATIO
  Timescale = 2
  kappa = 0.1400645
  2KappaMu = 0.0014846837
  rho = 0.01039279
  rho2 = 0.09353509
  CSW = 1.74
  MaxSolverIterations = 500
  AcceptancePrecision =  1.e-21
  ForcePrecision = 1.e-18
  Name = cloverdetratio1light
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial

BeginMonomial CLOVERDETRATIO
  Timescale = 3
  kappa = 0.1400645
  2KappaMu = 0.0014846837
  rho = 0.0
  rho2 = 0.01039279
  CSW = 1.74
  MaxSolverIterations = 500
  AcceptancePrecision =  1.e-21
  ForcePrecision = 1.e-18
  Name = cloverdetratio2light
  solver = mg
  useexternalinverter = quda
  usesloppyprecision = single
EndMonomial

BeginMonomial NDCLOVERRAT
  Timescale = 1
  kappa = 0.1400645
  CSW = 1.74
  AcceptancePrecision =  1e-21
  ForcePrecision = 1e-16
  StildeMin = 0.0000376
  StildeMax = 4.7
  MaxSolverIterations = 500
  Name = ndcloverrat_0_3
  DegreeOfRational = 10
  Cmin = 0
  Cmax = 3
  ComputeEVFreq = 0
  2Kappamubar = 0.0394421632
  2Kappaepsbar = 0.0426076209
  AddTrLog = yes
  useexternalinverter = quda
  usesloppyprecision = single
  solver = cgmmsnd
EndMonomial

BeginMonomial NDCLOVERRAT
  Timescale = 2
  kappa = 0.1400645
  CSW = 1.74
  MaxSolverIterations = 1000
  AcceptancePrecision =  1e-21
  ForcePrecision = 1e-16
  # lambda_min = 8e-6 (min evals go as low as 1.5e-5), maximal evals are found as high as 0.85 and fluctuate strongly
  StildeMin = 0.0000376
  StildeMax = 4.7
  Name = ndcloverrat_4_6
  DegreeOfRational = 10
  Cmin = 4
  Cmax = 6
  ComputeEVFreq = 0
  2Kappamubar = 0.0394421632
  2Kappaepsbar = 0.0426076209
  AddTrLog = no
  useexternalinverter = quda
  usesloppyprecision = single
  solver = cgmmsnd
EndMonomial

BeginMonomial NDCLOVERRAT
  Timescale = 3
  kappa = 0.1400645
  CSW = 1.74
  AcceptancePrecision =  1e-21
  ForcePrecision = 1e-16
  MaxSolverIterations = 5000
  StildeMin = 0.0000376
  StildeMax = 4.7
  Name = ndcloverrat_7_9
  DegreeOfRational = 10
  Cmin = 7
  Cmax = 9
  ComputeEVFreq = 0
  2Kappamubar = 0.0394421632
  2Kappaepsbar = 0.0426076209
  AddTrLog = no
  useexternalinverter = quda
  usesloppyprecision = single
  solver = cgmmsnd
EndMonomial

BeginMonomial NDCLOVERRATCOR
  Timescale = 1
  kappa = 0.1400645
  CSW = 1.74
  AcceptancePrecision =  1e-20
  ForcePrecision = 1e-16
  MaxSolverIterations = 5000
  StildeMin = 0.0000376
  StildeMax = 4.7
  Name = ndcloverratcor
  DegreeOfRational = 10
  ComputeEVFreq = 0
  2Kappamubar = 0.0394421632
  2Kappaepsbar = 0.0426076209
  useexternalinverter = quda
  usesloppyprecision = double
  solver = cgmmsnd
EndMonomial

BeginIntegrator 
  Type0 = 2MNFG
  Type1 = 2MNFG
  Type2 = 2MNFG
  Type3 = 2MNFG
  IntegrationSteps0 = 1
  IntegrationSteps1 = 1
  IntegrationSteps2 = 1
  IntegrationSteps3 = 8
  Tau = 1.0
  Lambda0 = 0.166667
  Lambda1 = 0.166667
  Lambda2 = 0.166667
  Lambda3 = 0.166667
  NumberOfTimescales = 4
EndIntegrator

BeginOperator CLOVER
  kappa = 0.1400645
  2KappaMu = 0.0014846837
  CSW = 1.74
  UseEvenOdd = no
  SolverPrecision = 1e-20
  MaxSolverIterations = 500
  UseExternalInverter = QUDA
  Solver = mg
  usesloppyprecision = single
EndOperator
pittlerf commented 2 years ago

Sure, I just do something like:

L=16
T=32

nrxprocs = 1
nryprocs = 1
nrzprocs = 1
ompnumthreads = 3

Measurements = 10000
StartCondition = hot

NSave = 500000
ThetaT = 1.0

UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 2

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.9
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

BeginIntegrator 
  Type0 = 2MN
  IntegrationSteps0 = 100
  Tau = 1.0
  Lambda0 = 0.193
  NumberOfTimescales = 1
EndIntegrator

for pure gauge.

Thank you, we also see the problem, now try to identify the issue

Actually I found an issue with OPENMP threads, will upload the fixing asap

pittlerf commented 2 years ago

Sure, I just do something like:

L=16
T=32

nrxprocs = 1
nryprocs = 1
nrzprocs = 1
ompnumthreads = 3

Measurements = 10000
StartCondition = hot

NSave = 500000
ThetaT = 1.0

UseEvenOdd = yes
ReversibilityCheck = no
ReversibilityCheckIntervall = 100
DebugLevel = 2

BeginMonomial GAUGE
  Type = Iwasaki
  beta = 1.9
  Timescale = 0
  UseExternalLibrary = quda
EndMonomial

BeginIntegrator 
  Type0 = 2MN
  IntegrationSteps0 = 100
  Tau = 1.0
  Lambda0 = 0.193
  NumberOfTimescales = 1
EndIntegrator

for pure gauge.

Thank you, we also see the problem, now try to identify the issue

Actually I found an issue with OPENMP threads, will upload the fixing asap

@kostrzewa I try to push the fixing, but seems that I do not have right to push here: ERROR: Permission to etmc/tmLQCD.git denied to pittlerf. fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. Can you add me writing permissions? Thanks

kostrzewa commented 2 years ago

You should be able to push. How have you configured the remote?

kostrzewa commented 2 years ago

Actually you are right, the base permission was "read" for some reason. You should be able to push now.

pittlerf commented 2 years ago

Actually you are right, the base permission was "read" for some reason. You should be able to push now.

thank you :)

kostrzewa commented 2 years ago

Very nice, the last commit has done the trick. I have a few more instrumentation which I'd like to push if you don't mind and one small performance improvement.

pittlerf commented 2 years ago

Very nice, the last commit has done the trick. I have a few more instrumentation which I'd like to push if you don't mind and one small performance improvement.

of coarse, go ahead :)

kostrzewa commented 2 years ago

Thanks! It would be great if we could resolve the merge conflict and go over the remaining issues over the next few days.

kostrzewa commented 2 years ago

I think I've addressed the points in 6f579c8 and 7f61ee1.

kostrzewa commented 2 years ago

I realised that the merge conflict was rather major so I went ahead and fixed it, it wasn't particularly clear...

pittlerf commented 2 years ago

We added also an online checking tool, which can be used in debugging. Shall we remove the executable test_gauge_derivative then, or we keep it?

kostrzewa commented 2 years ago

We added also an online checking tool, which can be used in debugging. Shall we remove the executable test_gauge_derivative then, or we keep it?

Probably better to remove it since it might break and then we need to take care of it.

kostrzewa commented 2 years ago

I've made some small adjustments and think that this can be merged. The gauge derivative regularly produces relative deviations in excess of 1e-10 on my machine, so I've increased the threshold. I don't think that the deviations that I see are worrisome, however as they occur only when the component in question is of a similar size to the threshold. I've made program termination dependent on g_strict_residual_check, which allows one to study the behaviour of the deviations on all lattice sites over mutliple trajectories.