Issues with cA2.09.48 on JUWELS

martin-ueding commented 4 years ago

For the lack of a better place I put it just here. I am trying to run the contractions for cA2.09.48 on JUWELS. I got it compiled and I think that I have set it up correctly from the input file. The paths might be wrong as the perambulators are not in one subdirectory per random vector. So I do get the following here:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 19) >= this->size() (which is 19)

kostrzewa commented 4 years ago

Yeah, one of the inconsistencies... Should be easy enough to account for -> https://github.com/HISKP-LQCD/sLapH-contractions/blob/master/src/global_data_build_IO_names.cpp

martin-ueding commented 4 years ago

Oh, you know where this hits?

kostrzewa commented 4 years ago

I would have expected a perambulator or random vector reading failure, however...

martin-ueding commented 4 years ago

Guess I will be looking into the traceback stuff that we had discussed a few times.

kostrzewa commented 4 years ago

The IO names are constructed in src/global_data_build_IO_names.cpp, at the start of contract (or VdaggerV), build_IO_names(gd, config_i) is called and this populates the various members

gd.rnd_vec_construct.filename_list gd.peram_construct.filename_list gd.filename_eigenvectors

Just to be sure:

1) you did not reorganize the perambulators into the expected directory structure, right? 2) you chose appropriate dilution sizes for this lattice?

kostrzewa commented 4 years ago

I've answered my second question:

quark = u:6:TB:2:EI:6:DF:4:/p/scratch/chbn28/project/sLapH_perambulators/cA2a.09.48/light

should be

quark = u:6:TB:3:EI:4:DF:4:/p/scratch/chbn28/project/sLapH_perambulators/cA2a.09.48/light

kostrzewa commented 4 years ago

Guess I will be looking into the traceback stuff that we had discussed a few times.

Indeed, it would be useful to know where this was triggered without having to fire up a debugger...

martin-ueding commented 4 years ago

I had tried to compile the software with the -traceback flag for the ICC, but it did not magically print the backtrace.

I'm back from lunch and I got this backtrace from gdb:

#0  0x00002aaaabe76207 in raise () from /lib64/libc.so.6
#1  0x00002aaaabe778f8 in abort () from /lib64/libc.so.6
#2  0x00002aaaabd45993 in __gnu_cxx::__verbose_terminate_handler ()
    at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00002aaaabd4bb26 in __cxxabiv1::__terminate(void (*)()) ()
    at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00002aaaabd4bb61 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00002aaaabd4bd93 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x12e3b60, 
    tinfo=0x2aaaabe31998 <typeinfo for std::out_of_range>, dest=0x2aaaabd60a20 <std::out_of_range::~out_of_range()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x00002aaaabd478a1 in std::__throw_out_of_range_fmt (__fmt=<optimized out>)
    from /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/lib64/libstdc++.so.6
#7  0x0000000000815b73 in std::vector<DilutedFactorIndex, std::allocator<DilutedFactorIndex> >::_M_range_check (
    this=0x12d7b70, __n=19)
    at /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/include/c++/8.3.0/bits/stl_vector.h:960
#8  0x0000000000815bab in std::vector<DilutedFactorIndex, std::allocator<DilutedFactorIndex> >::at (this=0x12d7b70, 
    __n=19) at /gpfs/software/juwels/stages/2019a/software/GCCcore/8.3.0/include/c++/8.3.0/bits/stl_vector.h:981
#9  0x00000000008082a3 in read_parameters (gd=..., ac=7, av=0x7ffffffeeeb8)
    at /p/project/chbn28/ueding1/Code/sLapH-contractions/src/global_data.cpp:393
#10 0x0000000000710eb8 in main (ac=7, av=0x7ffffffeeeb8)
    at /p/project/chbn28/ueding1/Code/sLapH-contractions/main/contract.cpp:41

The offending line (global_data.cpp:393) is this one:

      auto const size2 = gd.quarkline_lookuptable.at("Q2").at(q[1]).rnd_vec_ids.size();

The type of that field is this:

  DilutedFactorIndicesCollection quarkline_lookuptable;

And that type is:

using DilutedFactorIndicesCollection =
    std::map<std::string, std::vector<DilutedFactorIndex>>;

So it is the second .at that fails the range check. The q come from TraceIndicesCollection trace_indices_map; which is typedef std::map<std::string, std::vector<Indices>> TraceIndicesCollection;. So the q are Indices which is just std::vector<ssize_t>.

It could be the dilution specification, I will check that.

martin-ueding commented 4 years ago

I have updated the dilution specification, it shows up in the logs, but there still is the same exception. I'll run in it GDB again to make sure that it is the same spot.

kostrzewa commented 4 years ago

The offending line (global_data.cpp:393) is this one:

Did you add something to the function? We seem to have different line numbers (Q0Q2-optimization branch):

[...]
389     int total_number_of_random_combinations_in_trQ1 = 0;
390     for (auto const &q : gd.trace_indices_map.at("trQ1")) {
391       total_number_of_random_combinations_in_trQ1 +=
392           gd.quarkline_lookuptable.at("Q1").at(q[0]).rnd_vec_ids.size();
393     }                                                                                                                                                    
[...]

martin-ueding commented 4 years ago

I have moved the perambulators like they are on QBIG, but that also did not resolve the issue:

./cnfg0240/rnd_vec_00/perambulator.rndvecnb00.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_00/randomvector.rndvecnb00.u.nbev0660.0240
./cnfg0240/rnd_vec_01/perambulator.rndvecnb01.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_01/randomvector.rndvecnb01.u.nbev0660.0240
./cnfg0240/rnd_vec_02/perambulator.rndvecnb02.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_02/randomvector.rndvecnb02.u.nbev0660.0240
./cnfg0240/rnd_vec_03/perambulator.rndvecnb03.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_03/randomvector.rndvecnb03.u.nbev0660.0240
./cnfg0240/rnd_vec_04/perambulator.rndvecnb04.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_04/randomvector.rndvecnb04.u.nbev0660.0240
./cnfg0240/rnd_vec_05/perambulator.rndvecnb05.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00240
./cnfg0240/rnd_vec_05/randomvector.rndvecnb05.u.nbev0660.0240

pittlerf commented 4 years ago

Is it sure, that it is connected to the perambulators? I remember having issue, that the sign of the required momentum in the VdaggerV was different that actually was written to the disk. Could that be the case here?

martin-ueding commented 4 years ago

It could also be that I have the VdaggerV incorrectly. I have never used pre-generated VdaggerV, so perhaps I am doing it wrong.

So in /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48 I have this structure:

cnfg0240/operators.0240.p_00-1.d_000.t_000
cnfg0240/operators.0240.p_00-1.d_000.t_001
cnfg0240/operators.0240.p_00-1.d_000.t_002
cnfg0240/operators.0240.p_00-1.d_000.t_003
cnfg0240/operators.0240.p_00-1.d_000.t_004
cnfg0240/operators.0240.p_00-1.d_000.t_005
cnfg0240/operators.0240.p_00-1.d_000.t_006
cnfg0240/operators.0240.p_00-1.d_000.t_007
cnfg0240/operators.0240.p_00-1.d_000.t_008
cnfg0240/operators.0240.p_00-1.d_000.t_009

And then in the file I have these options:

delta_config = 1
#path_config = /hiskp4/gauges/quenched/wilson_b5.85_L12T24

# eigenvector handling:
number_of_eigen_vec = 660
#path_eigenvectors   = /hiskp4/eigensystems/nf211/A40.24/
#name_eigenvectors   = eigenvectors

handling_vdaggerv   = read
path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48/cnfg0240

From the code I have these snippets that make up the file names:

  auto const full_path =
      (boost::format("/%s/cnfg%04d/operators.%04d") % path_vdaggerv % config % config)
          .str();

          std::string dummy = full_path + ".p_" + std::to_string(op.momentum[0]) +
                              std::to_string(op.momentum[1]) +
                              std::to_string(op.momentum[2]) + ".d_" +
                              to_string(op.displacement);

          auto const infile = (boost::format("%s.t_%03d") % dummy % t).str();

In case that these could not be opened there would be errors thrown. So either they are built or they can be found.

martin-ueding commented 4 years ago

Let's try that without the configuration number in the path!

path_vdaggerv = /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48

But that on the other hand means that there is no clear error when the file cannot be found!

martin-ueding commented 4 years ago

It seems that the path to the perambulators is not the real problem. Taking the output from some working contraction we can see what happens:

Memory consumption:
        OperatorFactory:        0.18 Gb
        RandomVector:   0.00 Gb
        Perambulator:   1.19 Gb
        DiagramParts:
                Q0:     0.10 Gb
                Q1:     0.00 Gb
                Q2:     0.10 Gb
        trQ1Q1: 0.00 Gb

And this is missing in the new case here:

        trQ0Q2: 17179869153.22 Gb
        trQ1:   0.00 Gb
        Diagrams:
        Perambulators initialised
        Random vectors initialised
        Meson operators initialised

processing configuration: 1430

[Start    ] Perambulator I/O
        Reading perambulator from file:
                /hiskp4/perambulators/nf211/A40.24/light/cnfg1430/rnd_vec_00/perambulator.rndvecnb00.u.TsoB0024.VsoI0006.DsoF4.TsiF00
[   Finish] Perambulator I/O. Duration: 0.7837574035 seconds. Threads: 1
…
[Start    ] Eigenvector and Gauge I/O
        Reading eigenvectors from files:/hiskp4/eigensystems/nf211/A40.24//eigenvectors.1430.000
…

So the actual reading of the files happens later on. So it seems that it would crash no matter presence of the actual files. I am not sure how that works with the VdaggerV “read” mode here, the other contractions were in “build”.

I will just try to run it with “build” and see whether that crashes later.

pittlerf commented 4 years ago

Hi Martin,

The VdaggerV reading happens after the perambulator reading:

     Meson operators initialised

processing configuration: 468

[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_00/perambu
lator.rndvecnb00.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 3.9700660706 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_01/perambu
lator.rndvecnb01.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1294889450 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_02/perambu
lator.rndvecnb02.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1566550732 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_03/perambu
lator.rndvecnb03.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 5.1603319645 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_04/perambulator.rndvecnb04.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 4.8916890621 seconds. Threads: 1
[Start    ] Perambulator I/O
        Reading perambulator from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_05/perambulator.rndvecnb05.u.TsoB0032.VsoI0004.DsoF4.TsiF0096.SsiF110592.DsiF4.CsiF3.smeared0.00468
[   Finish] Perambulator I/O. Duration: 4.8554379940 seconds. Threads: 1
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_00/randomvector.rndvecnb00.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_01/randomvector.rndvecnb01.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_02/randomvector.rndvecnb02.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_03/randomvector.rndvecnb03.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_04/randomvector.rndvecnb04.u.nbev0660.0468
        Reading random vector from file:
                /p/scratch/chbn28/hbn28d/peram_generation/nf2/cA2.09.48/light/cnfg0468/rnd_vec_05/randomvector.rndvecnb05.u.nbev0660.0468
[Start    ] VdaggerV I/O
        reading VdaggerV from file://p/scratch/chbn28/hbn28d/vdaggerv/nf2/cA2.09.48/data//cnfg0468/operators.0468.p_-100.d_000.t_009
        reading VdaggerV from file://p/scratch/chbn28/hbn28d/vdaggerv/nf2/cA2.09.48/data//cnfg0468/operators.0468.p_-100.d_000.t_006

pittlerf commented 4 years ago

at least in the case of the rho

martin-ueding commented 4 years ago

Okay, thank you! This means that something apart from the files is still wrong. Okay, I will try to figure that out.

pittlerf commented 4 years ago

I have tried your code with the three pion and the rho input file as well, and the three pion crashed with the same error and the rho actually did not crashed.

Shell debugging restarted
This is sLapH-contractions:
  git branch refs/heads/Q0Q2-optimization
  git revision 426b708b50927b22428ec50958aa104bbff2ab4a
  git state CLEAN
  compiled by pittler1 on juwels08.ib.juwels.fzj.de
  running with up to 24 OpenMP threads

pittlerf commented 4 years ago

I have tested that with

operator_list = g5.d0.p0

works

martin-ueding commented 4 years ago

Okay, thank you very much for testing this! That means that it is independent of any perambulator or VdaggerV files but there is something wrong prior to that. I will try to triage that further.

martin-ueding commented 4 years ago

I just realized that you were using a somewhat older version of the code from 2019-07-22. Since then a few things have happened:

$ git diff 426b708b50927b22428ec50958aa104bbff2ab4a.. --stat
 CMakeLists.txt                                                  |  26 ++---
 deprecated/VdaggerV.cpp                                         | 253 +++++++++++++++++++++++++++++++++++++++++++
 include/EigenVector.hpp                                         |   4 +
 include/OperatorsForMesons.hpp                                  |   9 +-
 include/global_data.hpp                                         |   5 +
 include/h5-wrapper.hpp                                          |   2 +-
 include/timings.hpp                                             |  53 +++++----
 job-script-generator/generate-contraction-jobs                  |   6 +-
 job-script-generator/job_script_juwels.sh.j2                    |  38 +++++++
 job-script-generator/job_script_qbig_slurm.sh.j2                |   2 +-
 job-script-generator/kill_runs.sh                               |   8 --
 main/VdaggerV.cpp                                               | 275 ++++++++++-------------------------------------
 main/contract.cpp                                               |  26 +++--
 make-jureca-build.sh                                            |  44 --------
 make-jurecabooster-build.sh                                     |  46 --------
 make-qbig-build.sh                                              |  31 ------
 migrate-plus                                                    |  82 --------------
 src/Correlators.cpp                                             |   2 +-
 src/DilutedFactorFactory.cpp                                    |   1 -
 src/DilutedTraceFactory.cpp                                     |  22 ++--
 src/EigenVector.cpp                                             |  49 ++++++++-
 src/OperatorFactory.cpp                                         | 234 +++++++++++++++++++++++++++++++---------
 src/global_data.cpp                                             | 176 ++++++++++++++++++------------
 src/init_lookup_tables.cpp                                      |   4 +
 tests/integration-L4-new/correlators-reference/C6cC_cnfg1000.h5 | Bin 195608 -> 195608 bytes
 tests/integration-L4/correlators-reference/C6cC_cnfg1000.h5     | Bin 195608 -> 195608 bytes
 utility/make-qbig-build.sh                                      |   6 +-
 utility/module_load_juwels_icc.sh                               |   8 ++
 utility/reverse-time-in-C6cC                                    |  79 ++++++++++++++
 29 files changed, 862 insertions(+), 629 deletions(-)

I have tried with your version and it also crashes. So I would say that the error has been there all along. On my laptop it takes 384 seconds until the program crashes. I realized that I did not have the cutoffs in place. Then all the sudden it crashed after 6 seconds. The range check now fails because it hits the index 7, but the same principle applies.

Just running it with operator_list = g5.d0.p0,1 works. More momenta do not work. On my laptop it tried to allocate the memory and unfortunately I have a swap partition and it succeeded 🙄.

Okay, then we have it pinned down pretty well, I believe.

martin-ueding commented 4 years ago

These combinations works:

momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1

momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2

But these ones does not:

momentum_cutoff_0 = 2
operator_list = g5.d0.p0,1,2

momentum_cutoff_0 = 4
operator_list = g5.d0.p0,1,2,3

This should not be the case, of course. The cutoffs should just remove all the operators that are not needed. So perhaps we have the problem that some operators are not used at all with the given cutoffs and that they therefore do not appear in some list, letting the indexing run over the vector.

I'm reaching the conclusion that I do not want to do anything with that code and just outright replace it now.

kostrzewa commented 4 years ago

Looking just at your output file on JUWELS for guidance,

[...]
    gd.path_config: 
    gd.handling_vdaggerv: read
    gd.path_vdaggerv: /p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48
  Momentum Cutoff: 0->4 1->5 2->6 3->7 4->4 

Memory consumption:
        OperatorFactory:        10.59 Gb
        RandomVector:   0.02 Gb
        Perambulator:   11.60 Gb
        DiagramParts:
                Q0:     0.12 Gb
                Q1:     0.00 Gb
                Q2:     0.07 Gb
        trQ1Q1: 0.00 Gb
Di 15. Okt 16:46:41 CEST 2019

It seems to me that the problem occurs in the iteration over the "quantum number indices" for trQ0Q2 accessing the indices of quark lines "Q0" or "Q2". Looking at the corresponding lines of global_data.cpp (around line 380), this seems to be the spot where things go wrong, hence my question above regarding line numbers in your backtrace.

I'm reaching the conclusion that I do not want to do anything with that code and just outright replace it now.

Yes, although one should keep in mind that all the lookup table stuff in the code is solving a rather complicated organisational problem and that replacing it is similarly not so simple. Going from the output of the subduction code back to VdaggerV objects and perambulators is not totally straightforward, I think.... Also, one doesn't just have to generate the required dependencies on these basic objects at this stage, but also on intermediate results such as Q0Q2.

As it is right now, it seems that objects are requested which are not added to some list at an appropriate stage, most likely in the construction of the dependencies for trQ0Q2, which themselves come from Cc4C and Cc6C, correct?

If a complete redesign is not feasible, one might be able to reshape the construction of the lookup tables by making use of maps with clear names for each required object instead of having to manually ensure unique indexing (as exemplified by the requirement of the unique_push_back used all over the place). Having unique (and human-readable) object names would also mean that it would be completely clear at this stage which object is missing, perhaps even allowing to quickly identify why this is so. Instead, all the information we have right now is an index number ...

martin-ueding commented 4 years ago

Sorry, I did not see that comment before.

Did you add something to the function? We seem to have different line numbers (Q0Q2-optimization branch):

Yes, a simple print statement which I later removed again.

The spot that you name certainly is the one where the code crashes.

Yes, although one should keep in mind that all the lookup table stuff in the code is solving a rather complicated organisational problem and that replacing it is similarly not so simple. Going from the output of the subduction code back to VdaggerV objects and perambulators is not totally straightforward, I think...

I still have not dug into it too deeply, so I likely underestimate the effort required. But in principle the correlators have all the information in them. Also I have the diagram specifications that tells me which Q0/Q1/Q2 objects are involved at which positions.

  map["C6cCD"] = {Vertices({0, 2, 4}, {1, 3, 5}),
                  TraceSpecs{{{"Q2", 3, 0}, {"Q0", 0, 1}, {"Q2", 1, 2}, {"Q0", 2, 3}},
                             {{"Q0", 4, 5}, {"Q2", 5, 4}}}};

Here you see that we have a Q2Q0Q2Q0 involving the vertices [0, 1, 2, 3] and another Q0Q2 involving the vertices [4, 5]. Then I chunk the HDF5 name and pick out the required momenta and gamma structures. I can also see whether they are source and sink indices and take into account that I might need to negate the momentum.

Using this I should be able to populate the operators, the Q objects and the trace(Q…) objects that are needed bottom up. And then use the indices from these to fill the index tables of the more complex objects.

kostrzewa commented 4 years ago

I still have not dug into it too deeply, so I likely underestimate the effort required. But in principle the correlators have all the information in them. Also I have the diagram specifications that tells me which Q0/Q1/Q2 objects are involved at which positions.

I agree, the diagram spec should provide the necessary handle.

kostrzewa commented 4 years ago

Once the json based correlator input is merged, what's (roughly) the modus operandi if I wanted to run contractions for some new (or old) system?

martin-ueding commented 4 years ago

For a new system you would run my projection code to obtain the prescriptions. The contraction code contains an utility script which goes through all these prescriptions and extracts the dataset names and compiles them into the list of correlator names.

For an existing system, like the integration tests, I used h5ls and then created the file myself using some Vim commands. But also here it would be easy to write a script which does this for you. I can write that eventually, but skipped that for now.

kostrzewa commented 4 years ago

For a new system you would run my projection code to obtain the prescriptions. The contraction code contains an utility script which goes through all these prescriptions and extracts the dataset names and compiles them into the list of correlator names.

What if I don't have Mathematica? Just playing devil's advocate here to understand all steps.

What about the fact that QCT does not have real support for the twisted mass symmetries, does that matter?

kostrzewa commented 4 years ago

Supporting new types of contractions then necessarily also means having to touch the subduction code, correct?

martin-ueding commented 4 years ago

A third option I forgot is to just generate such a list of diagram names with some other means. Basically the part of the code that I removed (and had some bug in it) could be rewritten in some other language to output such a list.

What if I don't have Mathematica? Just playing devil's advocate here to understand all steps.

You could just personally buy an academically discounted version if your institution does not have it. I mean this has happened with Maple here once already.

But if you don't have Mathematica, you could not use the projection at all. This would mean that you would need to either acquire a license or rewrite all this in some other product.

Even if you use the old code and just let it generate a bunch of correlation function, you will need to know how these are to be combined. And once you have figured that out, you can also build a list. Unless you write a code which never explicitly constructs this list but uses the numeric data too soon.

What about the fact that QCT does not have real support for the twisted mass symmetries, does that matter?

It does not know anything about the Dirac matrices. I explicitly do the gamma-5 trick with pattern matching in the code. But there might be more opportunities to compute less correlators.

Supporting new types of contractions then necessarily also means having to touch the subduction code, correct?

I would rather say that once you have new operators you need to touch the projection code. The fact that you get different contractions from that is just a consequence of that.

But you are right, the projection code becomes somewhat of a dependency to the contraction code. On the other hand there is a clear interface, the JSON file with the correlator names, so one could substitute it with a different implementation.

kostrzewa commented 4 years ago

Thanks for the detailed explanation.

I would rather say that once you have new operators you need to touch the projection code. The fact that you get different contractions from that is just a consequence of that.

Yes, that what I meant.

A third option I forgot is to just generate such a list of diagram names with some other means. Basically the part of the code that I removed (and had some bug in it) could be rewritten in some other language to output such a list.

For simple systems (i.e., stuff at zero momentum basically) this will be necessary to have I believe. I agree that there's a clear interface now and it should be relatively straightforward to bang something together.

martin-ueding commented 4 years ago

For simple systems (i.e., stuff at zero momentum basically) this will be necessary to have I believe.

Well, for zero momentum you can just write them by hand. There is just one correlator per diagram type.

On JUWELS I advanced further, now I am at this error message:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Can't open //p/scratch/chbn28/project/vdaggerv/nf2/cA2.09.48/cnfg0240/operators.0240.p_011.d_000.t_003

Taking a look into that directory shows that not all momenta are actually built.

[ueding1 juwels03 .../vdaggerv/nf2/cA2.09.48/cnfg0240]
$ ls | cut -d . -f 3 | sort -u
p_00-1
p_00-2
p_0-10
p_0-11
p_0-1-1
p_0-20
p_-100
p_-101
p_-10-1
p_-110
p_-1-10
p_-111
p_-1-11
p_-11-1
p_-1-1-1
p_-200

The one that I want (0, 1, 1) is not available. But there is the conjugated (0, -1, -1) which I could use. I presume that somehow the conjugation of VdaggerV objects has not been resolved in the same fashion as it has been for the generation of these files.

If a file could not be found I could take the opposite parity file and do a hermitian conjugation. Or in the resolving of the VdaggerV that I need I could make sure that I always have a minus sign in the first non-zero momentum component. That is likely the easier method and will use a bit less memory.

kostrzewa commented 4 years ago

Well, for zero momentum you can just write them by hand. There is just one correlator per diagram type.

Yes, I agree completely, this is a very good input interface for the purpose, generalizing straightforwardly to multiple flavours and simple enough to write by hand for simple cases.

This approach is on the way towards a more general "diagram description language" which I thought might have been optimal for this type of rather general contraction code. At the time, I had the idea of a slightly more elaborate format which also included different modifiers for the quark lines (to support multiple smearing types, for example) and with an explicit listing of temporal and spatial coordinates, providing the ordering that is now provided by the diagram specifications internally. I dropped the idea to pursue this as there seemed to be no interest in the group at the time.

The one that I want (0, 1, 1) is not available. But there is the conjugated (0, -1, -1) which I could use. I presume that somehow the conjugation of VdaggerV objects has not been resolved in the same fashion as it has been for the generation of these files.

Yes, there is now way around checking both +p and -p VdaggerV files, taking the one that exists and tagging that it should be multiplied conjugated.

kostrzewa commented 4 years ago

Yes, there is now way around checking both +p and -p VdaggerV files, taking the one that exists and tagging that it should be multiplied conjugated.

It might make sense to create a map of available VdaggerV objects at startup (when in "read" mode) in order to avoid overly frequent requests to the file system.

kostrzewa commented 4 years ago

On JUWELS I advanced further, now I am at this error message:

at least it throws now :)

martin-ueding commented 4 years ago

Okay, we are all set to tackle the next problem 😄😫. The code reads in the perambulators, the VdaggerV (thanks for your help in the meeting!). It starts the contractions:

[Start    ] All contractions
Thread   0 of   4 starts with block pair     1 of   528.
Thread   2 of   4 starts with block pair     0 of   528.
Thread   3 of   4 starts with block pair     2 of   528.
Thread   1 of   4 starts with block pair     3 of   528.

But then of course, as we all have guessed, it did not finish further:

srun: error: jwc00n001: task 0: Killed
srun: Force Terminated job step 1739254.0

I noticed that I did not ask for any memory explicitly, so I have no idea how much I did get.

martin-ueding commented 4 years ago

I guess I got the full 90 GB. Logged into the node and using watch free -h I could see how the memory just got consumed and the job killed. And I was only running 4 threads! On this machine we should be running with 48 threads.

We likely waste a bunch of memory for the reduction in the end, I wanted to switch that over to atomics. But then still the per-thread cache is just an idiotic design for machines with this core/memory ratio. I see these options, none of them are great:

Use atomics for the reduction and therefore reduce the memory load and then it will magically fit into the memory with four threads. We will still waste almost all of the computing resources.
Clear the cache much more often. We already only have cross-like diagrams that we compute and also only the gamma-five operator. So we are in a good situation but still it does not work. This would allow us to go for many more threads, hopefully outperforming the cache with fewer threads.
Use less and less threads until it fits. Shudder
Try some other parallelization direction. Perhaps over the quantum numbers? We have like 250k different correlators that we want to compute, so we ought to be able to parallelize over them and just have one cache. That of course will take time and delay production even further.
Start with some subset, like discarding the higher moving frames or something.

kostrzewa commented 4 years ago

Try the large memory nodes and see what happens to begin with.

kostrzewa commented 4 years ago

Wasting resources is not really a problem if the calculation fits into our compute time budget

martin-ueding commented 4 years ago

Nope, 180 GB of memory are not enough with four threads either 🤷‍♂️. I have just implemented the atomics stuff, but that will not be much. We have a result variable for every correlator and every time slice. Before we also had another one per thread, so basically it has been this:

sizeof(Complex) * num_correlators * (extent_time + num_threads)

With 96 time slices and four threads the change is large and that is only part of memory usage. With like 250k correlators that is 4 MB per time slice or thread. Hmm, so I have only saved a couple of MB with that change, I guess.

martin-ueding commented 4 years ago

Besides that I looked into it when it ran for already like 40 minutes and it has been on the first time slice. Start and end are 14:24:49 and 15:33:58, so it ran over an hour to fill up the memory. But each thread did not even finish its first time slice combination out of 528. So optimistically assuming that it just did not finish by a few GB of memory, we would have an hour for 4 slices. So that would be a fun 132 hours (5.5 days) for the whole configuration. Completely infeasible.

But we are using just 4 of 48 cores that the machine has. So if we would manage to use all without additional overheads it would be ×12 faster. JUWELS does not have a walltime limit, but I guess we have the usual 24h with quota and the 6h without quota? So we need to achieve ×5.5 to fit into the walltime and potentially have ×12 in resources available. Perhaps we can clear the caches more often and then it might fit and the recomputation does not kill us.

martin-ueding commented 4 years ago

I have now added a little change which will clear the QQ cache right after each trace. I have the feeling that this will not really change anything, because everything gets reused and therefore the stuff that is produced in each trace for a given time slice combination should just be reused completely. The second trace should not add anything to it.

I have submitted that on a large memory node on JUWELS, but I expect it crash. And even if that helps, it still will not allow us to fit 48 threads in there. We need a different strategy for parallelization, this just does not scale.

kostrzewa commented 4 years ago

JUWELS does not have a walltime limit, but I guess we have the usual 24h with quota and the 6h without quota?

Yes, the time limits are applied via QOS at the submit stage.

kostrzewa commented 4 years ago

Nope, 180 GB of memory are not enough with four threads either man_shrugging

perhaps you could try to run on QBIG just to see the actual memory load as a test (using the devel QOS)

the other thing that might be worth doing is to actually fix the memory load output that Markus started, that way you will be able to predict exactly how much memory you require.

I have now added a little change which will clear the QQ cache right after each trace. I have the feeling that this will not really change anything, because everything gets reused and therefore the stuff that is produced in each trace for a given time slice combination should just be reused completely. The second trace should not add anything to it.

and the relevant part of the cache is cleared anyway once a particular time slice pair has been computed, correct?

kostrzewa commented 4 years ago

There are relatively straightforward memory savings that one can obtain, although only for small numbers of concurrent threads.

As the correlators are computed cumulatively on paired time slices, one might load perambulator columns and VdaggerV on given time slices on the fly only when required and then discard them. (This needs some synchronisation between threads, as there will be common elements, consider the pairs (0,0), (0,1), (0,2), (0,3), for example, when running with four threads).

Even with 48 threads, this part of the memory load would be reduced markedly as one would have only a subset of the VdaggerV and perambulator data in memory at any one time.

Of course, this will mean that there will be some overhead, although the I/O time for perams and VdaggerV is quite acceptable and there would be at most a couple hundred read operations. (rather than one big one)

It does mean, however, that functions like create_operators and contract would have to be modified substantially.

kostrzewa commented 4 years ago

Or rather, VdaggerV and perambulator submatrices corresponding to sets of blocks would need to be read on the fly.

kostrzewa commented 4 years ago

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

kostrzewa commented 4 years ago

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

In fact, how long are the lists of correlator requests on average? These should be huge, right?

kostrzewa commented 4 years ago

At the level of diagrams, one could use nested parallelism, threading over the correlator requests.

In fact, how long are the lists of correlator requests on average? These should be huge, right?

I think this might solve all of our problems, it might even lead to better load balancing. But only if these lists are actually as long as I naively assume that they might be.

In practice, for systems with many momenta, one would use few threads over block pairs, but many threads over correlator requests, while for systems with few or a single momentum, one would thread over the block pairs only instead (as done in the past).

HISKP-LQCD / sLapH-contractions

Issues with cA2.09.48 on JUWELS #109