cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
201 stars 54 forks source link

segmentation fault in simple code #123

Closed rohany closed 3 years ago

rohany commented 3 years ago

I have a simple code here:

#include <ctf.hpp>
#include <chrono>
#include <float.h>
using namespace CTF;

bool mttkrp(int n, World& dw) {
  int dimt[3] = {n, n, n};
  Tensor<double> B(3, false /* is_sparse */, dimt, dw);
  Matrix<double> A(n, n, dw), C(n, n, dw), D(n, n, dw);
  B.fill_random((double)0, (double)1);
  C.fill_random((double)0, (double)1);
  D.fill_random((double)0, (double)1);

  auto start = std::chrono::high_resolution_clock::now();
  A["il"] = B["ijk"] * C["jl"] * D["kl"];
  auto end = std::chrono::high_resolution_clock::now();
  auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  if (dw.rank == 0) {
    std::cout << "Execution time: " << ms << " ms." << std::endl;
  }
}

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  {
    World dw;
    for (int i = 0; i < 10; i++) {
        mttkrp(512, dw);
    }
  }
  MPI_Finalize();
  return 0;
}

I compiled this by adding to the examples and makefile. After one iteration, it segfaults with this stack trace:

rohany@g0001:~/ctf$ OMP_NUM_THREADS=20 ./bin/mymttkrp
Execution time: 1771 ms.
[g0001:1784388] *** Process received signal ***
[g0001:1784388] Signal: Segmentation fault (11)
[g0001:1784388] Signal code: Address not mapped (1)
[g0001:1784388] Failing at address: 0x10000000e
[g0001:1784388] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fec0a4343c0]
[g0001:1784388] [ 1] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0xd3)[0x7fec0a453543]
[g0001:1784388] [ 2] ./bin/mymttkrp(+0x13c22)[0x5572e2710c22]
[g0001:1784388] [ 3] ./bin/mymttkrp(+0x171ab)[0x5572e27141ab]
[g0001:1784388] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fec0a2540b3]
[g0001:1784388] [ 5] ./bin/mymttkrp(+0x18dae)[0x5572e2715dae]
[g0001:1784388] *** End of error message ***
Segmentation fault

Have I done something wrong here, or is this a library bug?

rohany commented 3 years ago

Didn't think missing a return would cause this :(