QDP-JIT clover term broken for Nc > 3 builds

cpviolator commented 2 years ago

I've built some Nc=3,4,5 stacks using a variety of combinations of QDPXX, QDP-JIT, and a color agnostic branch of QUDA:

CHROMA + QDPXX
CHROMA + QDPJIT
CHROMA + QDPXX + QUDA
CHROMA + QDPJIT + QUDA

where in the +QUDA cases QUDA handles the inversions only as normal.

For Nc=3, all 4 stacks produce 2 Flavour clover HMC configurations (no link smearing) with excellent agreement in Delta H when using the same seed for all 4 stacks. When I move to Nc>3, only the CHROMA + QDPXX and CHROMA + QDPXX + QUDA are stable, in that their Delta H values are in agreement, and the solver converges. For both CHROMA + QDPJIT and CHROMA + QDPJIT + QUDA stacks, the solver diverges quite quickly. This isolates the problem to Chroma's QDPJIT clover term code. If QUDA's treatment of the QDPJIT ordered clover term were bad, only the CHROMA + QDPJIT + QUDA builds would produce poor data. Furthermore, the fact that both the CHROMA + QDPJIT + QUDA and CHROMA + QDPJIT stacks produce good data for Nc=3 demonstrates that the issue is in the Nc != 3 QDPJIT code only.

I'm using CHROMA devel with master merged in and some minor modifications to place guards around baryon code. QDPXX and QDPJIT are also the devel branches with similar harmless guards around baryon code. The QUDA branch is feature/Nc_agnostiscism which is up-to-date with QUDA's develop branch.

If someone could cast a more experienced eye than mine over the QDPJIT clover code in CHROMA to look for an error I cannot see, I would be very appreciative. I can also provide a stack reproducer, but I do not have push rights to QDPXX or QDPJIT, so I will have to fork them for downloading.

cpviolator commented 2 years ago

To give you some assurance that QUDA's treatment of the QDPJIT ordered clover term is correct, see for example this line https://github.com/lattice/quda/blob/feature/Nc_agnosticism/include/clover_field_order.h#L789

I attach a simple C++ function to demonstrate that the ordering is valid for all Nc

#include <stdio.h>
#include <iostream>

using namespace std;

int main(int argc, char **argv) {

  int Nc = atoi(argv[1]);
  int elems = 2*Nc*Nc-Nc;

  int idtab[elems];
  int idx = 0;
  const int L = 2*Nc-1;
  for(int col = 0; col<L; col++) {
    int id = col*(col+3)/2;
    idtab[idx] = id;
    std::cout << idtab[idx] << ", ";
    idx++;
    for(int i=col+1; i<L; i++) {
      id += i;
      idtab[idx] = id;
      std::cout << idtab[idx] << ", ";
      idx++;
    }
  }
  std::cout << endl;

  bool idtab_check[elems];
  for(int i=0; i<elems; i++) idtab_check[i] = false;

  for(int i=0; i<elems; i++)
    for(int k=0; k<elems; k++) 
      if(idtab[k] == i) idtab_check[k] = true;

  for(int i=0; i<elems; i++)
    if(idtab_check[i]) std::cout << "true at " << i << endl;
    else {
      std::cout << "false at " << i << endl;
      exit(0);
    }
}

fwinter commented 2 years ago

Great analysis! However, I don't follow the line of argument when you say: This isolates the problem to Chroma's QDPJIT clover term code. From what I understand, at this point it can also be QDPJIT not behaving somewhere for Nc != 3 which is reasonable to assume as the Nc != 3 case gets rarely tested (never actually). What I would do is reducing the action to pure Wilson and see how this goes. Does this make sense?

Do you want to share your changes for Nc != 3? If you think your changes to Chroma/devel and qdp-jit/devel are production-ready you can create pull requests.

Alternatively: You can point me to your branches and I test the Clover term for Nc != 3.

cpviolator commented 2 years ago

@fwinter Will do RE pure Wilson, good idea. I can certainly go ahead and make PRs for both chroma and qdp(xx/jit). They really are nothing more than compiler guards around Nc=3 code, and a couple of dud templates on QDP_NC that throw runtime errors if not instantiated, and compile time warnings to allow for clean CMake installation. In chroma the stouting routine must be disabled for Nc != 3 and it's the QDP baryon contraction kernels.

With those branches in the tree I can share a stack builder for all the combinations (and the XML files) I use to drive them with a fully remote source.

Just did a quick test with pure wilson and the behaviour is different, but still concerning. QDP-JIT is taking about 5x more iterations than QDPXX to solve the same problem. Common to both wilson and clover runs, the source norm is different when using QDP-JIT compared to QDPXX, both with and without QUDA enabled. I think you should see this for yourself so I'll get to work a reproducer ASAP.

fwinter commented 2 years ago

I dirty hacked qdpxx to build with Nc4 (not pushed, probably similar to what you did) and tested the Clover term in isolation. I got a corrupted stack back. Seems the routine LDagDLInvSiteLoop might be Nc3 specific. See the line:

      inv_offd[elem_kj] -= inv_offd[elem_kl] * v[l];

For Nc4 the value of elem_kj can climb to at least 27, but space is made only for 15 elements. So that can't work like that.

But: Clover is separate issue. Let me see what I can do for the pure Wilson Nc4 correctness. These PRs would be good.

cpviolator commented 2 years ago

Ah yes, this was an issue I had to fix when merging master into devel. My original Nc fixes were in the llvm clover term file, then I upgraded the stack and had to fix the hardcoded instances of 3, 6, 12, and 15.

I'll push my CHROMA now. It's in a development stage for Nc agnostic nHYP force smearing, but that work is orthogonal and should not affect the clover/wilson workflow.

https://github.com/JeffersonLab/chroma/tree/feature/Nc_agnostic_hyp_force

cpviolator commented 2 years ago

Attached is a stack builder for lassen. My environment was:

Currently Loaded Modules:
  1) StdEnv (S)   2) cmake/3.20.2   3) cuda/11.4.1   4) gcc/8.3.1   5) spectrum-mpi/rolling-release

Confirmed the clover and wilson behaviour with this stack. I cannot for QDPXX because of submodules, but if you can compile a CPU version of the stack with the relevant chroma you should be able to see the differences. Let me know if you want me to hand you the QDPXX I used. compile_stack.txt

fwinter commented 2 years ago

Currently I cannot confirm pure Wilson HMC trajectory with Nc4 is different for qdpjit vs qdpxx. I used the HMC test included with chroma, /tests/hmc/hmc.prec_wilson.ini.xml with modifications: tau0=0.05, nsteps=1, UpdatesThisRun=1. I tested this on a single node (1 MPI) with single precision qdpxx/qdpjit Nc4 built. How does this differ to what you tested? Can you send your input file?

fwinter commented 2 years ago

Ah, when there are more than one step in the trajectory then things go haywire!

cpviolator commented 2 years ago

hmc_wilson_L8T16_b10p0_kappa0p1250.ini.xml.txt hmc_clover_L8T16_b10p0_kappa0p1250.ini.xml.txt These are the two XML scripts I'm using with QUDA. One has to turn off gauge field reconstruction in SU(Nc>3) (nearly there with that) and I use a nested FGI.

Good call asking about pure Wilson. I'd combed the JIT clover code for days with no luck.

cpviolator commented 2 years ago

For good measure, I compiled an SU(2) stack in the hope of running into an out of bounds memory error, ~but a CHROMA + QDP-JIT build seems to behave the same way: a few good inversions then the solver breaks down.~ the solver seems stable for several full trajectories in SU(2), with no OOB errors.

Furthermore, the QDPXX and QDP-JIT runs for clover SU(2) are pretty much identical in solver residua and Delta H.

cpviolator commented 2 years ago

More clues. I ran SU(4) wilson with a reversibility check set at 1. The "Delta H" values are bad, but the DeltaDeltaH values are of order 1e-8, so I'd say that there is nothing wrong with the arithmetic aspect of SU(N) code (which explains why the first solve is always good.)

This one is a doozy. I also turned off fermions and just did a simple gauge leapfrog in SU(4). I ran with tau=0.5 and 100 steps and poor Delta H appeared immediately. Then, reduced tau to 0.01 and ran again. This time, the Delta H values were good, and very close to 0 for 20-30 steps. After that, the Delta H values seemed to jump up and down by multiples of 60... independently of the lattice size.

L=16, T=32
Delta H = -0.0001074920874089
Delta H = -0.000107621308416128
Delta H = -0.000106041319668293
Delta H = -0.000105856684967875
Delta H = -0.000105672515928745
Delta H = -0.000106114661321044
Delta H = -0.00010473863221705
Delta H = -0.000105655053630471
Delta H = -0.000104928389191628
Delta H = -0.000105801736935973
Delta H = -0.000103999627754092
Delta H = -0.000103744445368648
Delta H = -0.000104400096461177
Delta H = -0.000103401020169258
Delta H = -0.000103696715086699
Delta H = -0.000103837344795465
Delta H = -0.000103556551039219
Delta H = -0.000102093443274498
Delta H = 60.1784195967484
Delta H = -0.0605068339500576
Delta H = 239.284650662448
Delta H = 0.192462589824572
Delta H = 120.200147058116
Delta H = 59.8451581248082
Delta H = 239.613046649378
Delta H = 238.514666688163
Delta H = 120.201233078958

L=8, T=16
Delta H = -6.77519710734487e-06
Delta H = -6.60921796225011e-06
Delta H = -6.91338209435344e-06
Delta H = -6.30853173788637e-06
Delta H = -6.53583265375346e-06
Delta H = -6.39547943137586e-06
Delta H = -6.60271325614303e-06
Delta H = -6.62061211187392e-06
Delta H = -6.77235948387533e-06
Delta H = -6.75035698805004e-06
Delta H = -6.45926047582179e-06
Delta H = -6.56952033750713e-06
Delta H = -6.50600122753531e-06
Delta H = -6.14863529335707e-06
Delta H = -6.40802318230271e-06
Delta H = -6.55952317174524e-06
Delta H = -6.61509693600237e-06
Delta H = -6.40650978311896e-06
Delta H = -6.21813524048775e-06
Delta H = -6.25410757493228e-06
Delta H = -6.1643950175494e-06
Delta H = -5.88547845836729e-06
Delta H = -6.64479739498347e-06
Delta H = -6.16365286987275e-06
Delta H = -6.08709524385631e-06
Delta H = 59.9789676625951
Delta H = -59.9594243784668
Delta H = 59.7502840727539
Delta H = 59.8389409547817
Delta H = 59.5600050108915
Delta H = 239.439682095748
Delta H = 119.22457449291
Delta H = 59.6712494555977
Delta H = -119.634477368687

cpviolator commented 2 years ago

I changed the seeds for a few runs at SU(4) and the behaviour is very similar: jumps at the same trajectory with a slightly different value. I also changed the MPI geometry, to no avail. However, when I changed beta the jump value changed. I'm seeing that the jump value is about 5.9 beta n, i.e, at beta=10.2, the jumps were multiples of about 60. At beta=3.2 it was about 18, and at beta=15 it was around 89. 5.9 beta is suspiciously close to 2(Nc=3)*beta.

I suspected some sort of reunitarisation failure, so upped the order of the SU(N) Taylor exponentiation to 20, and added a reunit step of the staple term after exponentiation. This did not change the numbers or behaviour appreciably at all.

Put together, we have the above behaviour independent of MPI geometry, lattice size, and Nc, a dependency on beta, and only visible for very short trajectories with epsilon of 1e-4. For larger epsilon, the Delta H is always bad, and for epsilon < 1e-4, the behaviour takes longer to appear. I'd like to buy a vowel...

For reproducibility, this is the adjustment I made to my script to perform gauge only and omit fermions:

<Hamiltonian>
    <monomial_ids>
      <!--elem>wilson_two_flav_1</elem-->
      <!--elem>wilson_two_flav_2</elem-->
      <elem>gauge</elem>
    </monomial_ids>
  </Hamiltonian>
  <MDIntegrator>
    <tau0>0.01</tau0>
    <Integrator>
      <Name>LCM_TST_LEAPFROG</Name>
      <n_steps>100</n_steps>
      <monomial_ids>
        <!--elem>wilson_two_flav_1</elem-->
        <!--elem>wilson_two_flav_2</elem-->
        <elem>gauge</elem>
      </monomial_ids>
      <!--SubIntegrator>                                                                                                                                                          
        <Name>LCM_STS_FORCE_GRAD</Name>                                                                                                                                           
        <n_steps>4</n_steps>                                                                                                                                                      
        <monomial_ids>                                                                                                                                                            
          <elem>gauge</elem>                                                                                                                                                      
        </monomial_ids>                                                                                                                                                           
      </SubIntegrator-->
    </Integrator>
  </MDIntegrator>

cpviolator commented 2 years ago

OK, I adjusted the number of steps and the leapfrog from 100 to 50, then 25, and the jumps happened at the same trajectory. However, when I adjusted tau to 0.02 and attempted similar trajectory lengths, the jump behaviour happened sooner.

So tau, not the trajectory length or step size, is the culprit. I've inspected the XMLLOG files from identical QDPXX and QDPJIT runs and there is a mismatch in the deltaPE values, but not the deltaKE values, at the offending trajectory. I added (wilson) fermions to these very short trajectories and they do not seem to affect the behaviour... badness occurs at the same trajectory. It's a gauge thing.

fwinter commented 2 years ago

After a whole day of debugging finally got to the root of this. There's a problem with one of the QDP operators. It's not an easy fix. Working on it..

fwinter commented 2 years ago

Not sure how to fix this.

In the meantime I pushed a work-around. I've tested it for Nc4. Seems to work - at least for pure Wilson.

EDIT: Actually, this might not be a bug in QDP-JIT. I think the where function should be used with QDP types, not plain int. The reunit function was buggy for Nc>3. My commit fixed it.

cpviolator commented 2 years ago

Nicely done @fwinter, that where fix would have taken me weeks I'm sure! I pulled all the latest commits and I'm rebuilding for Nc=2,3,4,5 HMC. Will post the results on solver and Delta H behaviour.

cpviolator commented 2 years ago

Confirming that the 2 flavor HMC with QDP-JIT + QUDA clover (no smearing) and the QDPXX builds for SU(4) agree beautifully in Delta H and, naturally, solver residual. The speed-up over the pure 32 core CPU stack is about 19x at L=16, T=32 with 4 GPUs employed. I'm sure there is more room in there too!

Thank you so much for your help @fwinter . I'll leave it to you to close the issue when you're satisfied.

fwinter commented 2 years ago

Great! 16^3x32 is a small volume for 4 GPUs to work on. At the end of the output you'll find the peak GPU memory allocation. That's probably very small compared to the memory poolsize stated at the beginning. When you go to larger lattices you can keep an eye on the number of lattice objects copied to host memory. If that begins to raise to a large number that means there was memory paging to CPU memory going on - time to increase the number of GPUs in the job.

cpviolator commented 2 years ago

Yeah, I was working with L=16 and 4 nodes just to make sure MPI was working. In production we try to saturate the device memory as much as possible. With QUDA, there are some neat envars one can define to toggle between using PINNED memory and MANAGED memory, and that can help with memory management,

export QUDA_ENABLE_DEVICE_MEMORY_POOL=0
export QUDA_ENABLE_PINNED_MEMORY_POOL=0

which instructs QUDA to delete memory as soon as it is no longer needed. You pay a price for repeated malloc and free, but eviction of device memory to the host is reduced.

How does one find out how many CPU paging events occur in JIT?

cpviolator commented 2 years ago

While we're here, let me know if you'd like me to pint out some further Nc tweaks in qdp-jit (and qdpxx). One or two places were still hardcoded for 3 but are capable of Nc function. I did see one template hardcoded to 3:

https://github.com/JeffersonLab/qdp-jit/blob/37437739aa2d194eefc5a468e03749afea7f8906/include/qdp_sum.h#L309

and the hardcoded 3 and 2 su3 array sizes in lib/qdp_scalarvec_specific.cc lib/qdp_parscalar_specific.cc lib/qdp_scalar_specific.cc

which can be partially agnostic. And I'm pretty sure I can drum up some CMake for LLVM installation if you're still interested.

fwinter commented 2 years ago

Yes, it seems these locations can be generalized to Nc != 3.

The following numbers printed at the end of the run tell the total number of lattice fields being moved. This includes the ones for I/O and paging. For HMC in 4 dimensions to lowest number should be 4 for both. Anything higher is likely due to paging.

Memory cache \ lattice objects copied to host memory: 4\ lattice objects copied to device memory: 4\

Experience shows that a small amount of paging has only small impact on overall wall clock time.

On another note. qdp-jit has support for CUDA-aware MPI. If your compiler/MPI wrapper supports this it also must be enabled in qdp-jit via a runtime switch (-gpudirect). With this and the built-in overlapping comms the performance of a vanilla Chroma solver (CG or BiCGStab) shouldn't be bad. Of course, there's no MG in qdp-jit.

JeffersonLab / chroma

QDP-JIT clover term broken for Nc > 3 builds #67