cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
326 stars 92 forks source link

Segfault when running test on WSL 1.0 #221

Closed hecmay closed 4 years ago

hecmay commented 4 years ago

Environment: WSL 1.0 (Ubuntu 18.04 LTS) running on Windows 10. Many test cases failed with Segfault, including memory customization primitives and data packing primitives. The error message is as followed:

tests/test_schedule_memory.py Fatal Python error: Segmentation fault

Current thread 0x00007f1a47c90740 (most recent call first):
  File "/root/.local/lib/python3.7/site-packages/heterocl-0.1-py3.7.egg/heterocl/tvm/_ffi/_ctypes/function.py", line 183 in __call__
  File "/root/.local/lib/python3.7/site-packages/heterocl-0.1-py3.7.egg/heterocl/tvm/_ffi/function.py", line 280 in my_api_func
  File "/root/.local/lib/python3.7/site-packages/heterocl-0.1-py3.7.egg/heterocl/tvm/build_module.py", line 351 in lower
  File "/root/.local/lib/python3.7/site-packages/heterocl-0.1-py3.7.egg/heterocl/tvm/build_module.py", line 560 in build
  File "/root/.local/lib/python3.7/site-packages/heterocl-0.1-py3.7.egg/heterocl/api.py", line 318 in build
  File "/root/heterocl/tests/test_schedule_memory.py", line 10 in test_reuse_blur_x
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/python.py", line 1477 in runtest
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 244 in from_call
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 217 in call_runtest_hook
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/main.py", line 247 in _main
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 87 in <lambda>
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/_pytest/config/__init__.py", line 125 in main
  File "/root/anaconda3/envs/test/lib/python3.7/site-packages/pytest/__main__.py", line 7 in <module>
  File "/root/anaconda3/envs/test/lib/python3.7/runpy.py", line 85 in _run_code
  File "/root/anaconda3/envs/test/lib/python3.7/runpy.py", line 193 in _run_module_as_main
Segmentation fault (core dumped)
hecmay commented 4 years ago

The test cases can pass successfully on MacOS catalina (my local laptop), Ubuntu LTS 18.04 (CircleCI servers). Will check whether other Linux distributions in WSL have the same problem.

hecmay commented 4 years ago

As discussed with Hongzheng, the issue is caused by the GenerateReuseBuffer IR Pass. The allocate Stmt in Reuse node was optimized away somehow, so the Pass got a null ptr to it, and thus Segfault.

hecmay commented 4 years ago

Root cause: the Reuse Buffer stage is not successfully attached to the right place indicated by attach_scope, as a result, there is no Allocate IR Node in the Reuse IR Node's body. This error occurred from the op scheduling process...

The scheduling function was not able to find the child stage, most likely because of buffer pointer mismatching. I ran into some similar issues before, and I used some quick-and-dirty solution to circumvent it (i.e. matching the buffers based on their names instead of pointer), but I do not think it to be a good way to solve this issue...

zhangzhiru commented 4 years ago

@seanlatias please take a look after Thursday. This seems to have become a blocking issue.

seanlatias commented 4 years ago

Seems like it is still an OS-dependent issue. I take a closer look.

hecmay commented 4 years ago

@seanlatias @zhangzhiru This issue only occurs on WSL Ubuntu (I am also trying out other WSL Linux distributions including WSL CentOS). We cannot reproduce it on other commercial or private servers.

hecmay commented 4 years ago

This is basically the same issue as Hongzhen mentioned in: #222. There problem is that, as you can see, the WB stage is not attached properly in the IR:

// attr [extern(WB, 0x7fffdb49f6e0)] realize_scope = ""
realize WB() {
  // attr [[buffer(A, 0x7fffdb285480), Tensor(shape=[6, 6], op.name=A)]] buffer_bind_scope = tvm_tuple(0, 6, 0, 6)        // attr [[buffer(WB, 0x7fffdb125ac0), Tensor(shape=[], op.name=WB)]] buffer_bind_scope = tvm_tuple()
  produce WB {
    // attr [0] extern_scope = 0
    0
  }
  // attr [extern(_top, 0x7fffdb490a30)] realize_scope = ""                                                               realize _top() {
    // attr [[buffer(F, 0x7fffdaf39240), Tensor(shape=[3, 3], op.name=F)]] buffer_bind_scope = tvm_tuple(0, 3, 0, 3)
    // attr [[buffer(A, 0x7fffdb285480), Tensor(shape=[6, 6], op.name=A)]] buffer_bind_scope = tvm_tuple(0, 6, 0, 6)
    // attr [[buffer(_top, 0x7fffdb496230), Tensor(shape=[], op.name=_top)]] buffer_bind_scope = tvm_tuple()
    produce _top {
      // attr [0] extern_scope = 0
      // attr [extern(B, 0x7fffdb484090)] realize_scope = ""
      realize B([0, 4], [0, 4]) {
        // attr [[buffer(A, 0x7fffdb285480), Tensor(shape=[6, 6], op.name=A)]] buffer_bind_scope = tvm_tuple(0[110/1983]
        // attr [[buffer(B, 0x7fffdb465ca0), Tensor(shape=[4, 4], op.name=B)]] buffer_bind_scope = tvm_tuple(0, 4, 0, 4)        produce B {
          // attr [0] extern_scope = 0
          for (y, 0, 4) {
            for (x, 0, 4) {
              reuse A
              // attr [              for (y, 0, 4) {
                // attr [iter_var(y, Range(min=0, extent=4))] loop_scope = y
                for (x, 0, 4) {
                  // attr [iter_var(x, Range(min=0, extent=4))] loop_scope = x
                  // attr [buffer(sum, 0x7fffdb480620)] attach_scope = "B"
                  for (ra2, 0, 3) {
                    // attr [iter_var(ra2, Range(min=0, extent=3))] loop_scope = ra2
                    for (ra3, 0, 3) {
                      // attr [iter_var(ra3, Range(min=0, extent=3))] loop_scope = ra3
                      if (1) {
                        sum[0] = int32((int65((int64(A[((x + ra3) + ((y + ra2)*6))])*int64(F[(ra3 + (ra2*3))]))) + int65(sum[0])))
                      }
                    }
                  }
                  B[(x + (y*4))] = int32(sum[0])
                }
              }
] attach_scope = "B"                                                                                                                  // attr [extern(sum, 0x7fffdb46f740)] realize_scope = ""
              realize sum([0, 1]) {
                // attr [[buffer(sum, 0x7fffdb480620), Tensor(shape=[1], op.name=sum)]] buffer_bind_scope = tvm_tuple(0, 1)
                produce sum {
                  // attr [0] extern_scope = 0
                  for (x, 0, 1) {
                    sum[x] = int32(0)
                  }
                }
                for (ra2, 0, 3) {
                  for (ra3, 0, 3) {
                    if (1) {
                      sum[0] = int32((int65((int64(A[((x + ra3) + ((y + ra2)*6))])*int64(F[(ra3 + (ra2*3))]))) + int65(sum[0])))
                    }
                  }
                }
                B[(x + (y*4))] = int32(sum[0])
              }                                                                                                                     }
          }
        }
      }
    }
  }
}
seanlatias commented 4 years ago

Solved by #226.