cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
194 stars 53 forks source link

Problem with push_slice #125

Open airmler opened 2 years ago

airmler commented 2 years ago

Dear Edgar,

The following code snipped leads to a MPI deadlock when running on 72 mpi ranks.

#include <ctf.hpp>
int main(int argc, char ** argv){

  MPI_Init(&argc, &argv);
  CTF::World dw;

  int64_t n(278);
  int64_t n2(2);
  int syms[] = {NS};

  int64_t m[] = {n};
  int64_t k[] = {n2};
  CTF::Tensor<double> e(1, m, syms, dw, "e");
  CTF::Tensor<double> f(1, k, syms, dw, "e");

  int epsiStart[] = { 0 }, epsiEnd[] = { 2 };

  f = e.slice(epsiStart, epsiEnd);

  MPI_Finalize();
  return 0;
}

The reason for the deadlock seems to be following: the two tensors have different mappings

The tensor e has the mapping e[p72(0)c0] whereas f has f topo (, 2, 2, 2, 3, 3); CTF: Tensor mapping is ETXS01[p2(0)c0]

This leads to wrong numbers in src/redistribuion/slice.cxx:190 which implies an incorrect MPI_Sendrecv_replace operation in src/redistribution/slice.cxx:207 (only rank 2-71 reach the MPI_Sendrecv_replace command with send/recv destination 0 or 1).

The most crude workaround would be changing the if statement in src/tensor/untyped_tensor.cxx:1043. Is there any other solution or am I generally missing something here?

Best, Andreas

solomonik commented 2 years ago

@airmler I was able to reproduce this, thanks for reporting. Please try out the fix on branch fix_slice. The problem is that the new optimized slice code seems erroneous when a tensor is not distributed over all processors, such as the case with the small vector here. I simply turned off fast slice for such cases, which should hopefully be a robust fix. If performance of slice becomes an issue as a result, please let me know, it should be possible to extend the optimized code to handle such cases as well, will just require more work.

airmler commented 2 years ago

Thanks for the fast fix. This works perfect for us. In our calculations the large tensors are fully distributed, that is why this change is satisfactory. Cannot see an example where the extension would be really necessary.