Issues with cA2.09.48 on JUWELS

martin-ueding commented 4 years ago

For the lack of a better place I put it just here. I am trying to run the contractions for cA2.09.48 on JUWELS. I got it compiled and I think that I have set it up correctly from the input file. The paths might be wrong as the perambulators are not in one subdirectory per random vector. So I do get the following here:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 19) >= this->size() (which is 19)

martin-ueding commented 4 years ago

I was thinking about parallelizing over the correlators as well. There are cheap diagram and there are expensive ones. But in each case there are lots of them:

{'C4cC': 4709, 'C4cD': 4709, 'C6cC': 81391, 'C6cCD': 124021, 'C6cD': 47833}

And I would like this parallelization direction because it would mean that memory consumption would stay constant with the number of threads. The cache would need to be synchronized, but with omp critical this should not be too hard actually.

martin-ueding commented 4 years ago

Shoot, I just realized that there is no C2c included in it, we won't get the pion out of there. Do we just want zero momentum there or do we want to verify the dispersion relation? So I'd need to add a couple of these manually because my projection code did not cover the single particle case yet.

martin-ueding commented 4 years ago

I did run the version with the Q2Q0 optimization with a single thread and a single time slice combination on JUWELS. So far only the version in the devel QOS has run. It exhausted the 2 hours of walltime and was killed, I don't have memory or performance readings. Another job is running in the batch queue, but that has not started running yet.

It was built to keep the caches, so the memory load did not exceed the 90 GB within the two hours. That does not necessarily mean that it would not do that later on, but at least there is some hope that the 24 hour jobs in the batch partition will give us.

We can get a speed-up of up to 48 on JUWELS. But there are 528 time slices to run. This gives us a lower bound of 22 hours per configuration. Fortunately we can run the time slices individually and there is is just the IO that we would have to re-do every single time. We could also tweak it such that it does not load everything but only what it needs. This way we can split the jobs into more smaller jobs at the cost of having to merge and accumulate the HDF5 files that come out.

I will watch the jobs on JUWELS and at the same time proceed with Issue #111 to realize that speedup.

martin-ueding commented 4 years ago

I'll close this ticket because running it on JUWELS now works in principles, it is just too slow and we have #111 for that issue.

HISKP-LQCD / sLapH-contractions

Issues with cA2.09.48 on JUWELS #109