daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

[WIP] Dev issue #766 #847

Open Garic152 opened 1 month ago

Garic152 commented 1 month ago

Work in progress PR for #766.

After some more debugging, I am unfortunately still facing segmentation fault issues due to the test/api/cli/algorithms/components.daphne test when using the following command:

./bin/daphne --select-matrix-representations --args n=100,e=100 --explain property_inference test/api/cli/algorithms/components.daphne

After adding the RowAggMaxOp, the test does work and gives the same --explain property_inference output as the main branch.

When adding the SparsityRemainsIfAllInputOneOrZero trait in the DaphneInferSparsityOpInterface.cpp file to both the ewMin and ewMax operation, the property inference pass does work, but the select_matrix_repr pass fails with a segmentation fault issue, which I included in this gist file.

I compared the property inference pass output from both the main branch (right side) and #766 (left side) and found only one difference (the 2 outputs are included in the gist file as well):

image

Before going into the while loop, there is another cast operation in the main branch, which is not in included after adding the sparsitytrait for ewMin and ewMax. I unfortunately cannot come up with a reason why this is the case, but strongly think this might be the reason for the segmentation fault.

After further examining the while loop with all the previously mentioned sparsity traits added, I noticed that when not using a while loop (which i changed to loop 2 times) to calculate the result for c, but rather repeating the contents twice manually

  ...
  // Iterative computation of connected components (decisive part).
  u = max(aggMax(G * t(c), 0), c);
  diff = sum(u != c);
  c = u;
  iter = iter + 1;

  u = max(aggMax(G * t(c), 0), c);
  diff = sum(u != c);
  c = u;
  iter = iter + 1;
  ...

the code executes without any problems, which further adds to the confusion of why not using a while loop, which should not change the codes logic, changes the outcome in terms of the passes.

I will continue to examine the behaviour of the code over the next few days and perhaps add the sparsity estimators one by one to see where errors might have occurred, but especially after the strange while loop behaviour I am starting to run out of ideas on how to properly fix this segfault problem.

Edit: After having executed the above mentioned cli test, I have actually come across another segfault error type once now, which I included at the end of the gist file. I am not sure if this is related to the changes in #766 or if it's maybe just hardware related, but wanted to mention it nonetheless.

philipportner commented 1 month ago

Hi @Garic152 , thanks for the PR.

The segfault happens here in the SelectMatrixRepresentationsPass.cpp. Here's a section of the call stack at the time of the segfault. See 3 and 12 as the others are just LLVM internals.

─── Stack ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#0  0x00005555578eb99e in llvm::PointerIntPair<mlir::Type, 3u, mlir::detail::ValueImpl::Kind, llvm::PointerLikeTypeTraits<mlir::Type>, llvm::PointerIntPairInfo<mlir::Type, 3u, llvm::PointerLikeTypeTraits<mlir::Type> > >::setPointer(mlir::Type) & (this=0x28, PtrVal=...) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/PointerIntPair.h:65
#1  0x00005555578eb73d in mlir::detail::ValueImpl::setType (this=0x20, type=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:66
#2  0x00005555578eb76a in mlir::Value::setType (this=0x7fffffffabb0, newType=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Value.h:133
#3  0x0000555557ae4f63 in SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}::operator()(mlir::Operation) const (__closure=0x55555b0ebd98, op=0x55555b0c8a10) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:64
#4  0x0000555557ae6101 in std::_Function_handler<mlir::WalkResult (mlir::Operation*), SelectMatrixRepresentationsPass::walkOp::{lambda(mlir::Operation*)#1}>::_M_invoke(std::_Any_data const&, mlir::Operation*&&) (__functor=..., __args#0=@0x7fffffffac60: 0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:285
#5  0x0000555557ae3a75 in std::function<mlir::WalkResult (mlir::Operation*)>::operator()(mlir::Operation*) const (this=0x55555b0ebd98, __args#0=0x55555b0c8a10) at /usr/include/c++/9/bits/std_function.h:688
#6  0x0000555557ae3591 in llvm::function_ref<mlir::WalkResult (mlir::Operation*)>::callback_fn<std::function<mlir::WalkResult (mlir::Operation*)> >(long, mlir::Operation*) (callable=93825088273816, params#0=0x55555b0c8a10) at /home/philipportner/daphne/thirdparty/installed/include/llvm/ADT/STLFunctionalExtras.h:45
#7  0x000055555877338f in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#8  0x0000555558773337 in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#9  0x0000555557ae29e7 in mlir::detail::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::Operation*, mlir::WalkResult>(mlir::Operation*, std::function<mlir::WalkResult (mlir::Operation*)>&) (op=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Visitors.h:171
#10 0x0000555557ae2352 in mlir::Operation::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x55555b0adfd0, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/Operation.h:621
#11 0x0000555557ae1577 in mlir::OpState::walk<(mlir::WalkOrder)0, std::function<mlir::WalkResult (mlir::Operation*)>&, mlir::WalkResult>(std::function<mlir::WalkResult (mlir::Operation*)>&) (this=0x7fffffffae08, callback=...) at /home/philipportner/daphne/thirdparty/installed/include/mlir/IR/OpDefinition.h:148
#12 0x0000555557ae5a0c in SelectMatrixRepresentationsPass::runOnOperation (this=0x55555b0ebc30) at /home/philipportner/daphne/src/compiler/inference/SelectMatrixRepresentationsPass.cpp:157

The problem seems to be that the condition of the WhileOp has more arguments than the block of the body of the WhileOp.

Here is the IR of the beforeBlock:

>>> p beforeBlock.dump()
^bb0(%arg0: si64, %arg1: f64, %arg2: !daphne.Matrix<100x1xf64:sp[1.000000e+00]>):
  %17 = "daphne.ewGt"(%arg1, %4) : (f64, f64) -> f64
  %18 = "daphne.cast"(%17) : (f64) -> si64
  %19 = "daphne.ewLe"(%arg0, %6) : (si64, si64) -> si64
  %20 = "daphne.ewAnd"(%18, %19) : (si64, si64) -> si64
  %21 = "daphne.cast"(%20) : (si64) -> i1
  "scf.condition"(%21, %arg0, %arg2) : (i1, si64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()

Her is the IR of the afterBlock:

>>> p afterBlock.dump()
^bb0(%arg0: si64, %arg1: f64):
  %17 = "daphne.transpose"(%arg1) : (f64) -> !daphne.Matrix<1x100xf64:sp[1.000000e+00]>
  %18 = "daphne.ewMul"(%14, %17) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]:rep[sparse]>, !daphne.Matrix<1x100xf64:sp[1.000000e+00]>) -> !daphne.Matrix<100x100xf64:sp[1.990000e-02]>
  %19 = "daphne.maxRow"(%18) : (!daphne.Matrix<100x100xf64:sp[1.990000e-02]>) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %20 = "daphne.ewMax"(%19, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %21 = "daphne.ewNeq"(%20, %arg1) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>, f64) -> !daphne.Matrix<100x1xf64:sp[1.000000e+00]>
  %22 = "daphne.sumAll"(%21) : (!daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> f64
  %23 = "daphne.ewAdd"(%arg0, %9) : (si64, si64) -> si64
  "scf.yield"(%23, %22, %20) : (si64, f64, !daphne.Matrix<100x1xf64:sp[1.000000e+00]>) -> ()

The loop condition here is evaluated to i < 3:

>>> p whileOp.getNumOperands()
$3 = 3

While the afterBlock only expects 2 arguments:

>>> p afterBlock.getNumArguments()
$4 = 2

Looking at the documentation of the scf::WhileOp, I think this is a bug on our side, we cannot simply assume that the after block has the same number of arguments as the WhileOp or the before block.

It forwards the trailing, non-condition operands of the scf.condition terminator either to the “after” region if the control flow is transferred there or to results of the scf.while operation otherwise. The “after” region takes as arguments the values produced by the “before” region and uses scf.yield to supply new arguments for the “before” region, into which it transfers the control flow unconditionally.

I debugged the current code at HEAD a bit and it seems like the returnsKnownProperties never returns true with a scf::WhileOp here. The scf::WhileOp returns -1 for it's results type sparsity here as getSparsity for !daphne.Matrix<100x1xf64> returns -1.

We sadly don't have any tests besides test/api/cli/vectorized/MultiThreadedOpsTest.cpp and test/api/cli/algorithms/AlgorithmsTest.cpp that set the --select-matrix-repr flag at all, and the tests in test/api/cli/vectorized/MultiThreadedOpsTest.cpp don't have a scf::WhileOp.

To sum up, the segfault you are experiencing comes from a bug in the existing code concerning the special treatment of scf-ops in the SelectMatrixRepresentationsPass. As I'm not familiar with this code and given that it's pretty old and sadly not properly tested, I cannot propose a solution to you at this time, but as I've not started debugging this I'll try to write some tests and fix this problem over the weekend.