[WIP] Dev issue #766 - Githubissues

Hi @Garic152 , thanks for the PR.

The segfault happens here in the SelectMatrixRepresentationsPass.cpp. Here's a section of the call stack at the time of the segfault. See 3 and 12 as the others are just LLVM internals.

─── Stack ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#0  0x00005555578eb99e in llvm::PointerIntPair<mlir::Type, 3u, mlir::detail::ValueImpl::Kind, llvm::PointerLikeTypeTraits<mlir::Type>, llvm::PointerIntPairInfo<mlir::Type, 3u, llvm::PointerLikeTypeTraits<mlir::Type> > >::setPointer(mlir::Type) & (this=0x28, PtrVal=...) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/PointerIntPair.h:65
#1  0x00005555578eb73d in mlir::detail::ValueImpl::setType (this=0x20, type=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:66
#2  0x00005555578eb76a in mlir::Value::setType (this=0x7fffffffabb0, newType=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:133
#3  0x0000555557ae4f63 in SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}::operator()(mlir::Operation) const (__closure=0x55555b0ebd98, op=0x55555b0c8a10) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:64
#4  0x0000555557ae6101 in std::_Function_handler<mlir::WalkResult (mlir::Operation*), SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}>::_M_invoke(std::_Any_data const&, mlir::Operation*&&) (__functor=..., __args#0=@0x7fffffffac60: 0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:285
#5  0x0000555557ae3a75 in std::function<mlir::WalkResult (mlir::Operation*)>::operator()(mlir::Operation*) const (this=0x55555b0ebd98, __args#0=0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:688
#6  0x0000555557ae3591 in llvm::function_ref<mlir::WalkResult (mlir::Operation*)>::callback_fn<std::function<mlir::WalkResult (mlir::Operation*)> >(long, mlir::Operation*) (callable=93825088273816, params#0=0x55555b0c8a10) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/STLFunctionalExtras.h:45
#7  0x000055555877338f in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#8  0x0000555558773337 in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#9  0x0000555557ae29e7 in mlir::detail::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::Operation*, mlir::WalkResult>(mlir::Operation*, std::function<mlir::WalkResult (mlir::Operation*)>&) (op=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Visitors.h:171
#10 0x0000555557ae2352 in mlir::Operation::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Operation.h:621
#11 0x0000555557ae1577 in mlir::OpState::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x7fffffffae08, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/OpDefinition.h:148
#12 0x0000555557ae5a0c in SelectMatrixRepresentationsPass::runOnOperation (this=0x55555b0ebc30) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:157

The problem seems to be that the condition of the WhileOp has more arguments than the block of the body of the WhileOp.

Here is the IR of the beforeBlock:

>>> p beforeBlock.dump()
^bb0(%arg0: si64, %arg1: f64, %arg2: !daphne.Matrix<100x1xf64:sp[1.000000e+00]>):
  %17 = "daphne.ewGt"(%arg1, %4) : (f64, f64) -> f64
  %18 = "daphne.cast"(%17) : (f64) -> si64
  %19 = "daphne.ewLe"(%arg0, %6) : (si64, si64) -> si64
  %20 = "daphne.ewAnd"(%18, %19) : (si64, si64) -> si64
  %21 = "daphne.cast"(%20) : (si64) -> i1
  "scf.condition"(%21, %arg0, %arg2) : (i1, si64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()

Her is the IR of the afterBlock:

>>> p afterBlock.dump()
^bb0(%arg0: si64, %arg1: f64):
  %17 = "daphne.transpose"(%arg1) : (f64) -> !daphne.Matrix<1x100xf64:sp[1.000000e+00]>
  %18 = "daphne.ewMul"(%14, %17) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]:rep[sparse]>, !daphne.Matrix<1x100xf64:sp[1.000000e+00]>) -> !daphne.Matrix<100x100xf64:sp[1.990000e-02]>
  %19 = "daphne.maxRow"(%18) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]>) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %20 = "daphne.ewMax"(%19, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %21 = "daphne.ewNeq"(%20, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %22 = "daphne.sumAll"(%21) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> f64
  %23 = "daphne.ewAdd"(%arg0, %9) : (si64, si64) -> si64
  "scf.yield"(%23, %22, %20) : (si64, f64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()

The loop condition here is evaluated to i < 3:

>>> p whileOp.getNumOperands()
$3 = 3

While the afterBlock only expects 2 arguments:

>>> p afterBlock.getNumArguments()
$4 = 2

Looking at the documentation of the scf::WhileOp, I think this is a bug on our side, we cannot simply assume that the after block has the same number of arguments as the WhileOp or the before block.

It forwards the trailing, non-condition operands of the scf.condition terminator either to the “after” region if the control flow is transferred there or to results of the scf.while operation otherwise. The “after” region takes as arguments the values produced by the “before” region and uses scf.yield to supply new arguments for the “before” region, into which it transfers the control flow unconditionally.

I debugged the current code at HEAD a bit and it seems like the returnsKnownProperties never returns true with a scf::WhileOp here. The scf::WhileOp returns -1 for it's results type sparsity here as getSparsity for !daphne.Matrix<100x1xf64> returns -1.

We sadly don't have any tests besides test/api/cli/vectorized/MultiThreadedOpsTest.cpp and test/api/cli/algorithms/AlgorithmsTest.cpp that set the --select-matrix-repr flag at all, and the tests in test/api/cli/vectorized/MultiThreadedOpsTest.cpp don't have a scf::WhileOp.

To sum up, the segfault you are experiencing comes from a bug in the existing code concerning the special treatment of scf-ops in the SelectMatrixRepresentationsPass. As I'm not familiar with this code and given that it's pretty old and sadly not properly tested, I cannot propose a solution to you at this time, but as I've not started debugging this I'll try to write some tests and fix this problem over the weekend.

daphne-eu / daphne

[WIP] Dev issue #766 #847

Work in progress PR for #766.