cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
194 stars 53 forks source link

oom/memory corruption running an SDDMM (using TTTP specialized routine) #139

Closed rohany closed 2 years ago

rohany commented 2 years ago

The following program appears to OOM (killed by the job scheduler) when with a process per core on a 40 node machine, and errors out with a memory corruption bug with 2 processes (running with only 1 process raises a warning that the local size of a tensor is larger than INT_MAX) when run on the arabic-2005 matrix.

void sddmm(int nIter, int warmup, std::string filename, std::vector<int> dims, World& dw, int jdim) {
  Tensor<double> B(2, true /* is_sparse */, dims.data(), dw);
  Matrix<double> C(dims[0], jdim, dw);
  Matrix<double> D(dims[1], jdim, dw);
  C.fill_random(1.0, 1.0);
  D.fill_random(1.0, 1.0);

  B.read_sparse_from_file(filename.c_str());

  int modes[] = {0, 1};
  Tensor<double>* mats[] = {&C, &D};
  TTTP(&B, 2, modes, mats);
}
rohany commented 2 years ago

This appeared to be the same problem with dimension + 1 as discussed in https://github.com/cyclops-community/ctf/issues/138. Fixing that makes this work.