apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.76k stars 3.47k forks source link

[Bug] Segmentation fault (core dumped) in executing the inference #17311

Open Cookiee235 opened 2 months ago

Cookiee235 commented 2 months ago

Actual behavior

Segmentation fault (core dumped)

Steps to reproduce

import tvm
from tvm import relax
import numpy as np
from tvm.script import ir as I
from tvm.script import tir as T
from tvm.script import relax as R

@I.ir_module
class Module:
    @T.prim_func(private=True)
    def layer_norm(A: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32"), gamma: T.Buffer((T.int64(64), T.int64(64)), "float32"), beta: T.Buffer((T.int64(64), T.int64(64)), "float32"), T_layer_norm: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32")):
        T.func_attr({"op_pattern": 4})
        # with T.block("root"):
        rxplaceholder_red_temp_v0 = T.alloc_buffer((T.int64(64), T.int64(64)))
        rxplaceholder_red_temp_v1 = T.alloc_buffer((T.int64(64), T.int64(64)))
        for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
            with T.block("rxplaceholder_red_temp"):
                ax0, ax1, k2, k3 = T.axis.remap("SSRR", [i0, i1, i2, i3])
                T.reads(A[ax0, ax1, k2, k3])
                T.writes(rxplaceholder_red_temp_v0[ax0, ax1], rxplaceholder_red_temp_v1[ax0, ax1])
                with T.init():
                    rxplaceholder_red_temp_v0[ax0, ax1] = T.float32(0)
                    rxplaceholder_red_temp_v1[ax0, ax1] = T.float32(0)
                v_rxplaceholder_red_temp_v0: T.float32 = rxplaceholder_red_temp_v0[ax0, ax1] + A[ax0, ax1, k2, k3]
                v_rxplaceholder_red_temp_v1: T.float32 = rxplaceholder_red_temp_v1[ax0, ax1] + A[ax0, ax1, k2, k3] * A[ax0, ax1, k2, k3]
                rxplaceholder_red_temp_v0[ax0, ax1] = v_rxplaceholder_red_temp_v0
                rxplaceholder_red_temp_v1[ax0, ax1] = v_rxplaceholder_red_temp_v1
        for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
            with T.block("T_layer_norm"):
                ax0, ax1, ax2, ax3 = T.axis.remap("SSSS", [i0, i1, i2, i3])
                T.reads(A[ax0, ax1, ax2, ax3], rxplaceholder_red_temp_v0[ax0, ax1], rxplaceholder_red_temp_v1[ax0, ax1], gamma[ax2, ax3], beta[ax2, ax3])
                T.writes(T_layer_norm[ax0, ax1, ax2, ax3])
                T_layer_norm[ax0, ax1, ax2, ax3] = (A[ax0, ax1, ax2, ax3] - rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003)) * T.rsqrt(rxplaceholder_red_temp_v1[ax0, ax1] * T.float32(0.050000000000000003) - rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003) * (rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003)) + T.float32(1.0000000000000001e-05)) * gamma[ax2, ax3] + beta[ax2, ax3]

    @T.prim_func(private=True)
    def relu(A: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32"), B: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32")):
        T.func_attr({"op_pattern": 0})
        # with T.block("root"):
        for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
            with T.block("relu"):
                v_i0, v_i1, v_i2, v_i3 = T.axis.remap("SSSS", [i0, i1, i2, i3])
                T.reads(A[v_i0, v_i1, v_i2, v_i3])
                T.writes(B[v_i0, v_i1, v_i2, v_i3])
                B[v_i0, v_i1, v_i2, v_i3] = T.max(A[v_i0, v_i1, v_i2, v_i3], T.float32(0))

    @R.function(private=True)
    def fused_layer_norm_relu(x: R.Tensor((1, 512, 64, 64), dtype="float32"), mean: R.Tensor((64, 64), dtype="float32"), var: R.Tensor((64, 64), dtype="float32")) -> R.Tensor((1, 512, 64, 64), dtype="float32"):
        R.func_attr({"Primitive": 1})
        cls = Module
        with R.dataflow():
            gv0 = R.call_tir(cls.layer_norm, (x, mean, var), out_sinfo=R.Tensor((1, 512, 64, 64)))
            gv = R.call_tir(cls.relu, (gv0,), out_sinfo=R.Tensor((1, 512, 64, 64), dtype="float32"))
            R.output(gv)
        return gv

    @R.function
    def main(x: R.Tensor((1, 512, 64, 64), dtype="float32"), mean: R.Tensor((64, 64), dtype="float32"), var: R.Tensor((64, 64), dtype="float32")) -> R.Tensor((1, 512, 64, 64), dtype="float32"):
        cls = Module
        with R.dataflow():
            gv: R.Tensor((1, 512, 64, 64), dtype="float32") = cls.fused_layer_norm_relu(x, mean, var)
            R.output(gv)
        return gv

mod = Module
mod = relax.transform.FuseTIR()(mod)

def compile_mod(mod, func_name, target, *inputs):
    ex = relax.build(mod, target='llvm')
    vm = relax.VirtualMachine(ex, tvm.cpu())
    mod_outputs = vm[f'{func_name}'](*inputs) #segfault

input_0 = tvm.nd.array(10 * np.random.random([1, 512, 64, 64]).astype('float32'))
input_1 = tvm.nd.array(10 * np.random.random([64, 64]).astype('float32'))
input_2 = tvm.nd.array(10 * np.random.random([64, 64]).astype('float32'))
compile_mod(mod, 'main', 'llvm', input_0,input_1,input_2)

CC @Lunderberg @vinx13

Lunderberg commented 2 months ago

Looks like a bug in your layer_norm implementation. The rxplaceholder_red_temp_v0 and rxplaceholder_red_temp_v1 both have shape [64,64], but are being accessed at indices [0:1, 0:512]. Are these buffers intended to have shape [1,512]?

Cookiee235 commented 2 months ago

@Lunderberg Indeed, the indices are out of the boundary. However, Segmentation fault is a dangerous behavior and is often considered a vulnerability. Do we need an isolation mechanism to check the validity in the tir level?

Lunderberg commented 2 months ago

True. There is a very old mechanism that uses the "tir.instrument_bound_checkers" config option to add buffer bounds, but the code path for it is only used for tir functions produced from TE schedules. The annotations it provides haven't been useful since #9727 a few years ago, as all BufferLoad and BufferStore instances have have sufficient information to do bounds-checking anyways. There's a newer tir::transform::OOBChecker() that does a better bounds checking, introduced in #12352, but it doesn't seem to be used anywhere.

I like the idea of having this check applied by default. Placing it either at the beginning or end of the TIR lowering pipeline would have caught this error during compilation.

Cookiee235 commented 2 months ago

@Lunderberg Thanks! The tvm.tir.analysis.OOBChecker() is very useful. I like it! It successfully identifies the OOB issue and avoids the segfault. But It was disabled by default. Why not enable it by default?

Lunderberg commented 2 months ago

Overall, no significant reasons not to. There may be some initial failures in cases where the buffer size is unknown, and which use a shape of [1] as a fallback, but those probably should be fixed anyways.