Open Garic152 opened 4 weeks ago
Hi @Garic152 , thanks for working on this.
Our code generation pipeline is currently not run by default and only executed by adding --mlir-codegen
when executing daphne
. As the code generation pipeline does not support all workloads at the moment, it should be sufficient if you run daphne
without the --mlir-codegen
flag.
When you run --explain
, the last explain output we provide is llvm
, so that's at the end of our pipeline, and if that one printed something than the lowering pipeline has completed.
Looking at the stack trace you provided, if you pipe if through c++filt
you get demangled names.
cat stacktrace.txt | c++filt
./bin/daphne(+0x148b8e2)[0x5fa05da8d8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x79716d045320]
/daphne/bin/../lib/libAllKernels.so(DenseMatrix<double>::getValuesInternal(IAllocationDescriptor const*, Range const*)+0x412)[0x797161dcf8d2]
/daphne/bin/../lib/libAllKernels.so(_transpose__DenseMatrix_double__DenseMatrix_double+0x8e)[0x797161c9405e]
[0x79716d699477]
[0x79716d69d109]
[0x79716d69d16d]
[0x79716d69d2dd]
./bin/daphne(+0x1e04e15)[0x5fa05e406e15]
./bin/daphne(+0x14ad1b3)[0x5fa05daaf1b3]
./bin/daphne(+0x14b262d)[0x5fa05dab462d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x79716d02a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x79716d02a28b]
./bin/daphne(+0x148a3a5)[0x5fa05da8c3a5]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 11
Seems like the problem is triggered by the _transpose__DenseMatrix_double__DenseMatrix_double
kernel at one of the calls to DenseMatrix::getValues
.
I'd suggest compiling with --debug
and starting daphne
with a debugger to figure out what's going on. gdb --args ./bin/daphne tools/dml2daph/translated_files/test_kmeans.daph
Otherwise, it would be good if you can create a minimal reproducible example that triggers this problem and create an issue :)
My bad, I missed that you already included a reproducible example.
Here's the backtrace:
(gdb) bt
#0 std::__uniq_ptr_impl<Range, std::default_delete<Range> >::_M_ptr (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:154
#1 std::unique_ptr<Range, std::default_delete<Range> >::get (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:361
#2 std::unique_ptr<Range, std::default_delete<Range> >::operator bool (this=0x10) at /usr/include/c++/9/bits/unique_ptr.h:375
#3 std::operator==<Range, std::default_delete<Range> >(std::unique_ptr<Range, std::default_delete<Range> > const&, decltype(nullptr)) (__x=std::unique_ptr<Range> = {...}) at /usr/include/c++/9/bits/unique_ptr.h:722
#4 DenseMatrix<double>::getValuesInternal (this=0x555559cef0b0, alloc_desc=<optimized out>, range=<optimized out>) at /home/philipportner/daphne/src/runtime/local/datastructures/DenseMatrix.cpp:194
#5 0x00007fffee2730a6 in DenseMatrix<double>::getValues (range=0x0, alloc_desc=0x0, this=<optimized out>) at /home/philipportner/daphne/src/runtime/local/datastructures/DenseMatrix.h:221
#6 Transpose<DenseMatrix<double>, DenseMatrix<double> >::apply (ctx=0x555559c15c60, arg=<optimized out>, res=@0x7fffffffb1d0: 0x555559dac1b0) at /home/philipportner/daphne/src/runtime/local/kernels/Transpose.h:63
#7 transpose<DenseMatrix<double>, DenseMatrix<double> > (ctx=0x555559c15c60, arg=<optimized out>, res=@0x7fffffffb1d0: 0x555559dac1b0) at /home/philipportner/daphne/src/runtime/local/kernels/Transpose.h:40
#8 _transpose__DenseMatrix_double__DenseMatrix_double (res=0x7fffffffb1d0, arg=0x555559cef0b0, kId=245, ctx=0x555559c15c60) at /home/philipportner/daphne/build/src/runtime/local/kernels/kernels_62.cpp:15
#9 0x00007ffff7fc4477 in m_kmeans-2-1 ()
#10 0x00007ffff7fc8109 in main ()
#11 0x00007ffff7fc816d in _mlir_ciface_main ()
#12 0x00007ffff7fc82dd in _mlir__mlir_ciface_main ()
#13 0x00005555575544ab in mlir::ExecutionEngine::invokePacked(llvm::StringRef, llvm::MutableArrayRef<void*>) ()
#14 0x0000555556cdb041 in mlir::ExecutionEngine::invoke<>(llvm::StringRef) (funcName=..., this=0x555559d325f0) at /usr/include/c++/9/bits/basic_string.h:940
#15 startDAPHNE (argc=2, argv=0x7fffffffd878, daphneLibRes=0x0, id=<optimized out>, user_config=...) at /home/philipportner/daphne/src/api/internal/daphne_internal.cpp:613
#16 0x0000555556ce0806 in mainInternal (argc=2, argv=0x7fffffffd878, daphneLibRes=0x0) at /home/philipportner/daphne/src/api/internal/daphne_internal.cpp:668
#17 0x00007ffff75cb083 in __libc_start_main (main=0x555556cae2f0 <main(int, char const**)>, argc=2, argv=0x7fffffffd878, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd868) at ../csu/libc-start.c:308
#18 0x0000555556cae22e in _start ()
After diving into some more debugging, I now came across another very weird error, this time in line 123. This error occurs during the execution of an element-wise multiplication operation within the EWMin
function.
Here the code crashes with the following error message:
eric@daphne-container:/daphne$ ./bin/daphne tools/dml2daph/translated_files/test_kmeans.daph
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/daphne/bin/../lib/libAllKernels.so(_ZNK14MetaDataObject9getLatestEv+0xf)[0x7d89e75e61cf]
/daphne/bin/../lib/libAllKernels.so(_ZN11DenseMatrixIdE17getValuesInternalEPK21IAllocationDescriptorPK5Range+0x3c0)[0x7d89e75cf770]
/daphne/bin/../lib/libAllKernels.so(+0x77651e)[0x7d89e737651e]
/daphne/bin/../lib/libAllKernels.so(_ewMin__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double+0x4e)[0x7d89e7377f7e]
[0x7d89f3759e2d]
[0x7d89f375ad94]
[0x7d89f375addd]
[0x7d89f375af3d]
./bin/daphne(+0x1e04e15)[0x5e54cb933e15]
./bin/daphne(+0x14ad1b3)[0x5e54cafdc1b3]
./bin/daphne(+0x14b262d)[0x5e54cafe162d]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7d89f2e2a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7d89f2e2a28b]
./bin/daphne(+0x148a3a5)[0x5e54cafb93a5]
[error]: Got an abort signal from the execution engine. Most likely an exception in a shared library. Check logs!
Execution error: Returning from signal 11
corrupted double-linked list
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d89f2e9eb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d89f2e4526e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d89f2e288ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7d89f2e297b6]
/lib/x86_64-linux-gnu/libc.so.6(+0xa8fe5)[0x7d89f2ea8fe5]
/lib/x86_64-linux-gnu/libc.so.6(+0xa9b6c)[0x7d89f2ea9b6c]
/lib/x86_64-linux-gnu/libc.so.6(+0xa9d1b)[0x7d89f2ea9d1b]
/lib/x86_64-linux-gnu/libc.so.6(+0xaad95)[0x7d89f2eaad95]
/lib/x86_64-linux-gnu/libc.so.6(+0xab42a)[0x7d89f2eab42a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x7e)[0x7d89f2eadd9e]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43atn12ATNConfigSetD2Ev+0x91)[0x7d89f36b5351]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43atn12ATNConfigSetD0Ev+0xd)[0x7d89f36b543d]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43dfa8DFAStateD0Ev+0xd)[0x7d89f3704d0d]
/usr/local/lib/libantlr4-runtime.so.4.9.2(_ZN6antlr43dfa3DFAD1Ev+0x53)[0x7d89f37015c3]
./bin/daphne(+0x152512c)[0x5e54cb05412c]
/lib/x86_64-linux-gnu/libc.so.6(+0x47a66)[0x7d89f2e47a66]
/lib/x86_64-linux-gnu/libc.so.6(+0x47bae)[0x7d89f2e47bae]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1d1)[0x7d89f2e2a1d1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7d89f2e2a28b]
./bin/daphne(+0x148a3a5)[0x5e54cafb93a5]
*** longjmp causes uninitialized stack frame ***: terminated
./bin/daphne(+0x148b8e2)[0x5e54cafba8e2]
/lib/x86_64-linux-gnu/libc.so.6(+0x45320)[0x7d89f2e45320]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7d89f2e9eb1c]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7d89f2e4526e]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7d89f2e288ff]
/lib/x86_64-linux-gnu/libc.so.6(+0x297b6)[0x7d89f2e297b6]
/lib/x86_64-linux-gnu/libc.so.6(+0x136c19)[0x7d89f2f36c19]
/lib/x86_64-linux-gnu/libc.so.6(+0x135c21)[0x7d89f2f35c21]
/lib/x86_64-linux-gnu/libc.so.6(__longjmp_chk+0x32)[0x7d89f2f37302]
./bin/daphne(+0x148b909)[0x5e54cafba909]
As the matrices themselves should work totally fine when printed and the EWMin operation on itself also works great, I did another analysis with gdb like @philipportner recommended.
This lead to the following backtrace:
Thread 1 "daphne" received signal SIGSEGV, Segmentation fault.
0x00007c21b1eabe87 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007c21b1eabe87 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007c21b1ead6e4 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007c21b22bb904 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007c21a61bd658 in std::__new_allocator<unsigned long>::allocate (this=0x7fff97eeb430, __n=1) at /usr/include/c++/13/bits/new_allocator.h:151
#4 0x00007c21a61bd2cc in std::allocator<unsigned long>::allocate (__n=1, this=0x7fff97eeb430) at /usr/include/c++/13/bits/allocator.h:198
#5 std::allocator_traits<std::allocator<unsigned long> >::allocate (__n=1, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:482
#6 std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_allocate (this=0x7fff97eeb430, __n=1) at /usr/include/c++/13/bits/stl_vector.h:381
#7 0x00007c21a64bb3b3 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_create_storage (this=0x7fff97eeb430, __n=1)
at /usr/include/c++/13/bits/stl_vector.h:398
#8 0x00007c21a64ba849 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_Vector_base (this=0x7fff97eeb430, __n=1, __a=...)
at /usr/include/c++/13/bits/stl_vector.h:335
#9 0x00007c21a64fd005 in std::vector<unsigned long, std::allocator<unsigned long> >::vector (this=0x7fff97eeb430,
__x=std::vector of length 1, capacity 1 = {...}) at /usr/include/c++/13/bits/stl_vector.h:603
#10 0x00007c21a64fc3ca in MetaDataObject::getLatest (this=0x5d50a500d600) at /daphne/src/runtime/local/datastructures/MetaDataObject.cpp:91
#11 0x00007c21a64cc573 in DenseMatrix<double>::getValuesInternal (this=0x5d50a5133770, alloc_desc=0x0, range=0x0)
at /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:188
#12 0x00007c21a6088ad9 in DenseMatrix<double>::getValues (this=0x5d50a5133770, alloc_desc=0x0, range=0x0)
at /daphne/src/runtime/local/datastructures/DenseMatrix.h:222
#13 0x00007c21a6176ce1 in EwBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> >::apply (opCode=BinaryOpCode::MUL,
res=@0x7fff97eeb940: 0x5d50a50de580, lhs=0x5d50a5133770, rhs=0x5d50a56ec970, ctx=0x5d50a58ef3a0) at /daphne/src/runtime/local/kernels/EwBinaryMat.h:66
#14 0x00007c21a61760d2 in ewBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> > (opCode=BinaryOpCode::MUL,
res=@0x7fff97eeb940: 0x5d50a50de580, lhs=0x5d50a5133770, rhs=0x5d50a56ec970, ctx=0x5d50a58ef3a0) at /daphne/src/runtime/local/kernels/EwBinaryMat.h:43
#15 0x00007c21a616086e in _ewMul__DenseMatrix_double__DenseMatrix_double__DenseMatrix_double (res=0x7fff97eeb940, lhs=0x5d50a5133770, rhs=0x5d50a56ec970,
kId=82, ctx=0x5d50a58ef3a0) at /daphne/build/src/runtime/local/kernels/kernels_21.cpp:186
#16 0x00007c21b3c055c6 in m_kmeans-2-1 ()
#17 0x00007c21b3c06d94 in main ()
#18 0x00007c21b3c06ddd in _mlir_ciface_main ()
#19 0x00007c21b3c06f3d in _mlir__mlir_ciface_main ()
#20 0x00005d50747eab45 in mlir::ExecutionEngine::invokePacked(llvm::StringRef, llvm::MutableArrayRef<void*>) ()
#21 0x00005d50738b3d93 in mlir::ExecutionEngine::invoke<>(llvm::StringRef) (this=0x5d50a5129d00, funcName=...)
at /usr/local/include/mlir/ExecutionEngine/ExecutionEngine.h:180
#22 0x00005d5073891b00 in startDAPHNE (argc=2, argv=0x7fff97eede98, daphneLibRes=0x0, id=0x7fff97eedb68, user_config=...)
at /daphne/src/api/internal/daphne_internal.cpp:613
#23 0x00005d50738935e8 in mainInternal (argc=2, argv=0x7fff97eede98, daphneLibRes=0x0) at /daphne/src/api/internal/daphne_internal.cpp:668
#24 0x00005d507388c552 in main (argc=2, argv=0x7fff97eede98) at /daphne/src/api/cli/daphne.cpp:19
There where several things i noticed when analyzing the backtrace:
In frame 13 inside of EwBinaryMat<DenseMatrix<double>, DenseMatrix<double>, DenseMatrix<double> >
, while valuesRhs
is a valid adress, valuesLhs
is a null pointer, which should not be the case.
In frame 12 (DenseMatrix<double>::getValues
), all local variables (isLatest
, id
, ptr
) report memory access errors:
(gdb) info locals
isLatest = <error reading variable: Cannot access memory at address 0x5d00a5073810>
id = <error reading variable: Cannot access memory at address 0x5>
ptr = <error reading variable: Cannot access memory at address 0x1>
This could indicate that the error seems to originate from frame 11 (getValuesInternal()) or even before that.
This is the *this
I get from frame 11:
#11 0x00007c21a64cc573 in DenseMatrix<double>::getValuesInternal (this=0x5d50a5133770, alloc_desc=0x0, range=0x0) at /daphne/src/runtime/local/datastructures/DenseMatrix.cpp:188
188 auto latest = this->mdo->getLatest();
(gdb) print this
$23 = (DenseMatrix<double> * const) 0x5d50a5133770
(gdb) print *this
$24 = {<Matrix<double>> = {<Structure> = {_vptr.Structure = 0x7c21a739b8f0, refCounter = 1, refCounterMutex = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, row_offset = 0, col_offset = 0, numRows = 1, numCols = 5, mdo = std::shared_ptr<MetaDataObject> (use count 1, weak count 0) = {get() = 0x5d50a500d600}}, <No data fields>}, is_view = false, rowSkip = 5,
values = std::shared_ptr<double []> (use count 1, weak count 0) = {get() = 0x5d50a52fe9e0}, bufferSize = 40, lastAppendedRowIdx = 0, lastAppendedColIdx = 0}
This is unfortunately the point where I am not sure anymore whether the Structure behaves normally or not because so far I didn't come into contact with the data structure generation part of Daphne at all.
4 Side notes that may help to better identify the source of the error:
I will push another reproducible error file kmeans_second_error.daph
in a second.
Also, the code does work when instead of using the ifelse construct at the end you only use one of the equal statements which would be applied depending on the ifelse.
I tried running it locally on the pr #758 as this PR refactors code concerning the getValues
functions. Doesn't fix the problem, and as you already pointed out, if lhs
is a nullptr
we messed up somewhere before.
The ternary at kmeans_final.daph:123
doesn't look right to me. Looks like you are trying to assign a f64
to a DenseMatrix
in one of the branches.
When adding this diff, the type is still DenseMatrix(375, 1)
, but trying to print(min_distances)
itself fails.
min_distances = i == 1 ? as.matrix<f64>(is_row_in_samples * distances) : as.f64(min(min_distances, distances));
+ print(typeOf(min_distances));
Translating the script turned out to be more of a challenge than expected.
After going through the code and comparing the auto-translated code variables line by line with the kmeans.dml output, there were many extremely sneaky errors, mostly due to the 0 vs. 1 based indexing translation errors, especially in the ctable function and also the reshape function, which does not reshape column by column (like in systemds), but row by row (maybe a functionality to choose between these 2 methods would be useful in daphne?)
After fixing these errors and making sure every operations does what it is supposed to when being compared to systemds, I still once again arrived at the segfault error I initially mentioned regarding the k-means-iteration step in line 153.
I tried reproducing this error in another file:
C = rand(5, 5, 0.0, 1.0, 1.0, 1234);
a = 10.0;
counter = 1.0;
while (counter < 5) {
print(C);
C_new = C + C;
b = a;
a = 1.0;
if (a < b) {
print("false statement");
} else {
counter = counter + 1;
C = C_new;
}
}
I replaced all variables and calculations with much simpler terms and also removed unnecessary parts of the original code that didn't have anything to do with the error to improve the codes debuggability.
When running the code, the while loop runs once, and after assigning C_new to C like in the kmeans algorithm, the contents of C are somehow corrupted.
I took a look at the IR, but didn't notice anything particularly wrong, the code also runs fine in numpy so there shouldn't be any logic errors. I also checked all the types, but there weren't any visible mistakes here either.
I would really appreciate some advice here, as this bug is unfortunately something I just cannot get behind.
Edit: This problem seems similar to the one in #558, where nested if statements also caused problems. (I came across this while looking at the multiLogReg.daph file I was also supposed to be working on).
Thanks for putting all this effort into translating the kmeans script, @Garic152! I know from my own experience that it can be a very cumbersome process. That's why it's good to identify all those points that the dml2daph
tool doesn't handle correctly yet; with that we can make the translator better over time.
reshape()
works differently in DAPHNE and SystemDS. Can you give an example?I had a look at the example in your latest comment. I further simplified it to further isolate the error. It turned out the problem is a double-free due to a bug in object reference management in combination with the arith.select
op. It can be seen with --explain obj_ref_mgnt
. For more details, see issue #911. I've prepared a fix in PR #912, but it's still in draft state, since it might create memory leaks (I will further investigate this). However, the change is essentially one line in src/compiler/lowering/ManageObjRefsPass.cpp
, so you could apply that to your clone and try if your scripts work then.
In very rare cases, such memory corruptions can happen through bugs in DAPHNE's reference counter management. It can be helpful to try running DAPHNE with --no-obj-ref-mgnt
, which switches off garbage collection, i.e., no data object will be freed. If a crashing DaphneDSL script succeeds with that flag, then the problem is likely related to DAPHNE's garbage collection.
The ternary at kmeans_final.daph:123 doesn't look right to me. Looks like you are trying to assign a f64 to a DenseMatrix in one of the branches.
@philipportner The conditional op looks good to me. The as.f64(...)
only sets the value type of the result, while retaining the data type (see the DaphneDSL language reference). I.e., if the input to the cast is a matrix, then the result is a matrix of f64.
Thanks for the support, the changes in #912 make the code work! The kmeans algorithm now runs to the end and works for some simple test cases I created. I will now add the commenting from systemds and create some more test cases to see if the algorithm works correctly in all cases.
Regarding the daphne reshape()
vs. systemds matrix()
, the reason for the difference was actually due to the byrow=bool
argument (which I just noticed) which was set to false
in the kmeans algorithm, so I had to transpose the matrix before applying the reshape function in the daphne translation.
I made some changes to dml2daph.py
as well (mostly adding new functions), which I could push together with the other smaller algorithms I translated so far.
I have no come to a state of the code where it's functionality should be ready very soon, but had to include many type conversions and small workarounds to make it work.
Some things I noticed where:
x = y == 1 ? 1 : 2.0
), having outputs of different types leads to errorsAfter my last adjustment of the file in line 188-190 which changed the values to be inserted from si64 to f64 for
insert_col
to work, i now have another error message of this type:When using the
--explain
function, all passes up tollvm
work fine, but themlir_codegen
pass doesn't seem to work. I have included a small test file in this PR that reproduces the error message.After fixing the remaining errors I will first test the functionality of the translated algorithm and then work on proper formatting and commenting.