Open Garic152 opened 1 month ago
Hi @Garic152 , thanks for the PR.
The segfault happens here in the SelectMatrixRepresentationsPass.cpp
.
Here's a section of the call stack at the time of the segfault. See 3 and 12 as the others are just LLVM internals.
─── Stack ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#0 0x00005555578eb99e in llvm::PointerIntPair<mlir::Type, 3u, mlir::detail::ValueImpl::Kind, llvm::PointerLikeTypeTraits<mlir::Type>, llvm::PointerIntPairInfo<mlir::Type, 3u, llvm::PointerLikeTypeTraits<mlir::Type> > >::setPointer(mlir::Type) & (this=0x28, PtrVal=...) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/PointerIntPair.h:65
#1 0x00005555578eb73d in mlir::detail::ValueImpl::setType (this=0x20, type=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:66
#2 0x00005555578eb76a in mlir::Value::setType (this=0x7fffffffabb0, newType=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:133
#3 0x0000555557ae4f63 in SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}::operator()(mlir::Operation) const (__closure=0x55555b0ebd98, op=0x55555b0c8a10) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:64
#4 0x0000555557ae6101 in std::_Function_handler<mlir::WalkResult (mlir::Operation*), SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}>::_M_invoke(std::_Any_data const&, mlir::Operation*&&) (__functor=..., __args#0=@0x7fffffffac60: 0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:285
#5 0x0000555557ae3a75 in std::function<mlir::WalkResult (mlir::Operation*)>::operator()(mlir::Operation*) const (this=0x55555b0ebd98, __args#0=0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:688
#6 0x0000555557ae3591 in llvm::function_ref<mlir::WalkResult (mlir::Operation*)>::callback_fn<std::function<mlir::WalkResult (mlir::Operation*)> >(long, mlir::Operation*) (callable=93825088273816, params#0=0x55555b0c8a10) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/STLFunctionalExtras.h:45
#7 0x000055555877338f in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#8 0x0000555558773337 in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#9 0x0000555557ae29e7 in mlir::detail::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::Operation*, mlir::WalkResult>(mlir::Operation*, std::function<mlir::WalkResult (mlir::Operation*)>&) (op=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Visitors.h:171
#10 0x0000555557ae2352 in mlir::Operation::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Operation.h:621
#11 0x0000555557ae1577 in mlir::OpState::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x7fffffffae08, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/OpDefinition.h:148
#12 0x0000555557ae5a0c in SelectMatrixRepresentationsPass::runOnOperation (this=0x55555b0ebc30) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:157
The problem seems to be that the condition of the WhileOp
has more arguments than the block of the body of the WhileOp
.
Here is the IR of the beforeBlock
:
>>> p beforeBlock.dump()
^bb0(%arg0: si64, %arg1: f64, %arg2: !daphne.Matrix<100x1xf64:sp[1.000000e+00]>):
%17 = "daphne.ewGt"(%arg1, %4) : (f64, f64) -> f64
%18 = "daphne.cast"(%17) : (f64) -> si64
%19 = "daphne.ewLe"(%arg0, %6) : (si64, si64) -> si64
%20 = "daphne.ewAnd"(%18, %19) : (si64, si64) -> si64
%21 = "daphne.cast"(%20) : (si64) -> i1
"scf.condition"(%21, %arg0, %arg2) : (i1, si64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()
Her is the IR of the afterBlock
:
>>> p afterBlock.dump()
^bb0(%arg0: si64, %arg1: f64):
%17 = "daphne.transpose"(%arg1) : (f64) -> !daphne.Matrix<1x100xf64:sp[1.000000e+00]>
%18 = "daphne.ewMul"(%14, %17) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]:rep[sparse]>, !daphne.Matrix<1x100xf64:sp[1.000000e+00]>) -> !daphne.Matrix<100x100xf64:sp[1.990000e-02]>
%19 = "daphne.maxRow"(%18) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]>) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
%20 = "daphne.ewMax"(%19, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
%21 = "daphne.ewNeq"(%20, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
%22 = "daphne.sumAll"(%21) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> f64
%23 = "daphne.ewAdd"(%arg0, %9) : (si64, si64) -> si64
"scf.yield"(%23, %22, %20) : (si64, f64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()
The loop condition here is evaluated to i < 3
:
>>> p whileOp.getNumOperands()
$3 = 3
While the afterBlock
only expects 2
arguments:
>>> p afterBlock.getNumArguments()
$4 = 2
Looking at the documentation of the scf::WhileOp, I think this is a bug on our side, we cannot simply assume that the after
block has the same number of arguments as the WhileOp
or the before
block.
It forwards the trailing, non-condition operands of the scf.condition terminator either to the “after” region if the control flow is transferred there or to results of the scf.while operation otherwise. The “after” region takes as arguments the values produced by the “before” region and uses scf.yield to supply new arguments for the “before” region, into which it transfers the control flow unconditionally.
I debugged the current code at HEAD
a bit and it seems like the returnsKnownProperties
never returns true
with a scf::WhileOp
here.
The scf::WhileOp
returns -1
for it's results type sparsity here as getSparsity
for !daphne.Matrix<100x1xf64>
returns -1
.
We sadly don't have any tests besides test/api/cli/vectorized/MultiThreadedOpsTest.cpp
and test/api/cli/algorithms/AlgorithmsTest.cpp
that set the --select-matrix-repr
flag at all, and the tests in test/api/cli/vectorized/MultiThreadedOpsTest.cpp
don't have a scf::WhileOp
.
To sum up, the segfault you are experiencing comes from a bug in the existing code concerning the special treatment of scf
-ops in the SelectMatrixRepresentationsPass
. As I'm not familiar with this code and given that it's pretty old and sadly not properly tested, I cannot propose a solution to you at this time, but as I've not started debugging this I'll try to write some tests and fix this problem over the weekend.
Work in progress PR for #766.
After some more debugging, I am unfortunately still facing segmentation fault issues due to the
test/api/cli/algorithms/components.daphne
test when using the following command:After adding the RowAggMaxOp, the test does work and gives the same
--explain property_inference
output as the main branch.When adding the
SparsityRemainsIfAllInputOneOrZero
trait in theDaphneInferSparsityOpInterface.cpp
file to both theewMin
andewMax
operation, the property inference pass does work, but the select_matrix_repr pass fails with a segmentation fault issue, which I included in this gist file.I compared the property inference pass output from both the main branch (right side) and #766 (left side) and found only one difference (the 2 outputs are included in the gist file as well):
Before going into the while loop, there is another cast operation in the main branch, which is not in included after adding the sparsitytrait for ewMin and ewMax. I unfortunately cannot come up with a reason why this is the case, but strongly think this might be the reason for the segmentation fault.
After further examining the while loop with all the previously mentioned sparsity traits added, I noticed that when not using a while loop (which i changed to loop 2 times) to calculate the result for
c
, but rather repeating the contents twice manuallythe code executes without any problems, which further adds to the confusion of why not using a while loop, which should not change the codes logic, changes the outcome in terms of the passes.
I will continue to examine the behaviour of the code over the next few days and perhaps add the sparsity estimators one by one to see where errors might have occurred, but especially after the strange while loop behaviour I am starting to run out of ideas on how to properly fix this segfault problem.
Edit: After having executed the above mentioned cli test, I have actually come across another segfault error type once now, which I included at the end of the gist file. I am not sure if this is related to the changes in #766 or if it's maybe just hardware related, but wanted to mention it nonetheless.