daschaich / susy

Codes for supersymmetric lattice gauge theories
GNU General Public License v3.0
16 stars 6 forks source link

Bug Report: Division by Zero #16

Closed westwind2013 closed 6 years ago

westwind2013 commented 6 years ago

Hi there,

When I run the program, it encounters an "Division by Zero" error. The error-inducing inputs are:

nx: 3 ny: 1 nz: 1 nt: 3 PBC: 5 iseed: 1 Nroot: 3 Norder: 1

The error locates in function setup_layout() inside susy/4d_Q16/generic/layout_hyper_prime.c. I fixes the error by checking if the divisor is 0, which is achieved via inserting the following code segment before the division operation, i.e., before line 115.

if (squaresize[XUP] == 0 ||
squaresize[XUP] == 0 ||
squaresize[ZUP] == 0 ||
squaresize[TUP] == 0 )
{ Abort(); }

Thank you!

daschaich commented 6 years ago

Thanks very much for the report! I am investigating, but am not yet able to reproduce the problem. With the inputs you provide, my tests correctly produce non-zero squaresize elements and abort on line 130 of susy/4d_Q16/generic/layout_hyper_prime.c, which checks that each node has an even number of sites (a limitation that I will mention more prominently in the README...).

I suspect the problem is related to the compiler I asked about in your other report. These squaresize elements are initialized to {nx, ny, nz, nt} and then divided by their prime factors, so they should never become zero, unless your compiler is doing something strange with integer division. I plan to add the check you suggest to the end of the setup_hyper_prime() routine on line 100---again maybe not until around 15 January.

westwind2013 commented 6 years ago

Thank you for the efforts to test it. My bad. I forgot mentioning that the triggering condition is "mpirun -n 4 ./susy-hmc". (running with 1 process does abort as you mentioned earlier)

The compiler is gcc-4.8, and MPI is mpich-2.1.

westwind2013 commented 6 years ago

Here is the error message the program emits:

westwind@VirtualBox:~/mpibench/susy-master/4d_Q16/susy$ mpirun -n 4 ./susy_hmc N=4 SYM, Nc = 2, DIMF = 4, fermion rep = adjoint Microcanonical simulation with refreshing Machine = MPI (portable), with 4 nodes Hybrid Monte Carlo algorithm Phi algorithm start: Fri Jan 5 16:57:23 2018

type 0 for no prompts or 1 for prompts 1 enter nx 3 nx 3 enter ny 1 ny 1 enter nz 1 nz 1 enter nt 3 nt 3 enter PBC 5 PBC 5 enter iseed 1 iseed 1 enter Nroot 3 Nroot 3 enter Norder 1 Norder 1 WARNING: Running with reduced dim(s) but didn't compile with -DDIMREDUCE LAYOUT = Hypercubes, options = hyper_prime,

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 136 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions

Not sure why the fonts of the error message gets so big :-O

daschaich commented 6 years ago

Ah, excellent, I've reproduced the problem with both -np 2 and -np 4. I had previously tested only -np 3 and serial running since the code tries to divide the (3x3) lattice evenly amongst all processes.

Commit a1cda29 should now provide a clean exit when this problem is encountered, so I'll close the issue. While I would like to track down how exactly the problem arises in the setup_hyper_prime() routine, this seems like enough of a corner case that at least for the time being I'm content just to punt and terminate.

westwind2013 commented 6 years ago

Great!