Mysterious -nan popping up in kernel

dionhaefner commented 7 years ago

The second constant 0.5 in the code

def calculate_velocity_on_wgrid(pyom):
    pyom.u_wgrid[:,:,:-1] = pyom.u[:,:,1:,pyom.tau] * pyom.maskU[:,:,1:] * 0.5 * pyom.dzt[None,None,1:] / pyom.dzw[None,None,:-1] \
                          + pyom.u[:,:,:-1,pyom.tau] * pyom.maskU[:,:,:-1] * 0.5 * pyom.dzt[None,None,:-1] / pyom.dzw[None,None,:-1]

is translated to a -nan in the JIT kernel:

rank: 0, size: 26416, block list:
    rank: 1, size: 49, news: {a49,a190,}, frees: {a49,}, temps: {a49,}, block list:
            BH_MULTIPLY a49[0:26416:49,0:49:1] a112[0:26416:150,1:50:3,1:2:1] a181[0:26416:50,1:50:1]
            BH_MULTIPLY a190[0:26416:49,0:49:1] a49[0:26416:49,0:49:1] 0.5
            BH_FREE a49[0:26416:49,0:49:1]
Write source "/tmp/bohrium_effe/src/d27251f667cc3cdf.c"
rank: 0, size: 26416, block list:
    rank: 1, size: 49, news: {a136,}, frees: {a190,}, block list:
            BH_MULTIPLY a136[0:26416:49,0:49:1] a190[0:26416:49,0:49:1] a154[0:26416:0,1:50:1]
            BH_FREE a190[0:26416:49,0:49:1]
Write source "/tmp/bohrium_effe/src/6a26971750a31219.c"
rank: 0, size: 26416, block list:
    rank: 1, size: 49, news: {a66,a194,a132,a51,}, frees: {a136,a66,a132,a51,}, temps: {a66,a132,a51,}, block list:
            BH_MULTIPLY a132[0:26416:49,0:49:1] a112[0:26416:150,0:49:3,1:2:1] a181[0:26416:50,0:49:1]
            BH_MULTIPLY a51[0:26416:49,0:49:1] a132[0:26416:49,0:49:1] -nan
            BH_FREE a132[0:26416:49,0:49:1]
            BH_DIVIDE a66[0:26416:49,0:49:1] a136[0:26416:49,0:49:1] a53[0:26416:0,0:49:1]
            BH_ADD a194[0:26416:49,0:49:1] a66[0:26416:49,0:49:1] a51[0:26416:49,0:49:1]
            BH_FREE a51[0:26416:49,0:49:1]
            BH_FREE a66[0:26416:49,0:49:1]
            BH_FREE a136[0:26416:49,0:49:1]
Write source "/tmp/bohrium_effe/src/7bfb3319dabe4166.c"
/tmp/bohrium_effe/src/7bfb3319dabe4166.c: In function ‘execute’:
/tmp/bohrium_effe/src/7bfb3319dabe4166.c:18:27: error: wrong type argument to unary minus
                 t3 = t0 * -nan;

Unfortunately, I have not been able to reproduce the bug in any other setting. Everything works when I remove either of the arrays, or when I run the code in isolation. The arrays only contain finite values with dtype float64. dzw does not contain any zeros.

I figured you guys might have an idea what could cause this. Otherwise, I'll have to dig deeper to try and reproduce the problem.

madsbk commented 7 years ago

Could you attach the kernel source, which is /tmp/bohrium_effe/src/7bfb3319dabe4166.c in this case.

And also try to insert a flush before and after the code:

def calculate_velocity_on_wgrid(pyom):
    np.flush()
    pyom.u_wgrid[:,:,:-1] = pyom.u[:,:,1:,pyom.tau] * pyom.maskU[:,:,1:] * 0.5 * pyom.dzt[None,None,1:] / pyom.dzw[None,None,:-1] \
                          + pyom.u[:,:,:-1,pyom.tau] * pyom.maskU[:,:,:-1] * 0.5 * pyom.dzt[None,None,:-1] / pyom.dzw[None,None,:-1]
    np.flush()

dionhaefner commented 7 years ago

Here's the kernel:

kernel.txt

Flushing makes no difference.

madsbk commented 7 years ago

Is this fixed by #229 ?

dionhaefner commented 7 years ago

I'll test tomorrow (I use the nightly PPA).

dionhaefner commented 7 years ago

Seems at least to be fixed on one of my setups, thanks! If the problem should come up again I'll reopen.

dionhaefner commented 7 years ago

Sorry, that was too quick. The code doesn't crash, but the arrays just contain nan now, while everything should be finite.

dionhaefner commented 7 years ago

Alright, I was able to boil it down to this:

import numpy as np

class PyOM(object):
    def __init__(self):
        self.nx = 100
        self.ny = 250
        self.nz = 50
        self.tau = 1
        self.dzt = np.zeros(self.nz)
        self.dzw = np.zeros(self.nz)
        self.maskU = np.zeros((self.nx+4, self.ny+4, self.nz))
        self.u = np.zeros((self.nx+4, self.ny+4, self.nz, 3))
        self.u_wgrid = np.zeros((self.nx+4, self.ny+4, self.nz))
        self.v_wgrid = np.zeros((self.nx+4, self.ny+4, self.nz))
        self.dzw = 1 + np.random.rand(self.nz)

def calculate_velocity_on_wgrid(pyom):
    np.flush()
    pyom.u_wgrid[:,:,:-1] = pyom.u[:,:,1:,pyom.tau] * pyom.maskU[:,:,1:] * 0.5 * pyom.dzt[None,None,1:] / pyom.dzw[None,None,:-1] \
                           + pyom.u[:,:,:-1,pyom.tau] * pyom.maskU[:,:,:-1] * 0.5 * pyom.dzt[None,None,:-1] / pyom.dzw[None,None,:-1]
    np.flush()

if __name__ == "__main__":
    pyom = PyOM()
    calculate_velocity_on_wgrid(pyom)
    print(np.any(np.isnan(pyom.u_wgrid)))

It works as soon as I remove the flush or either of the allocations (like v_wgrid, which isn't even used in the code).

bh107 / bohrium

Mysterious -nan popping up in kernel #220