cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
201 stars 53 forks source link

segfault executing sparse inner product #138

Open rohany opened 2 years ago

rohany commented 2 years ago

The following code raises different segfaults depending on the process count (on a single node), when run on nell-2 tensor.

void innerprod(int nIter, int warmup, std::string filename, std::string tensorC, std::vector<int> dims, World& dw) {
  Tensor<double> B(3, true /* is_sparse */, dims.data(), dw);
  Tensor<double> C(3, true /* is_sparse */, dims.data(), dw);
  Scalar<double> a(dw);

  B.read_sparse_from_file(filename.c_str());
  C.read_sparse_from_file(filename.c_str());

  a[""] = B["ijk"] * C["ijk"];
}

When run with a single process, it segfaults with the following backtrace:

/g/g15/yadav2/ctf/src/redistribution/sparse_rw.cxx:948 (discriminator 7)
/g/g15/yadav2/ctf/src/tensor/untyped_tensor.cxx:1302
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:609
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:940
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:952
/g/g15/yadav2/ctf/examples/spbench.cxx:199
/g/g15/yadav2/ctf/examples/spbench.cxx:317 (discriminator 7)

When run with 40 processes (1 process per core on my system): it segfaults with the following backtrace:

/g/g15/yadav2/ctf/src/contraction/contraction.cxx:119 (discriminator 3)
/g/g15/yadav2/ctf/src/interface/term.cxx:983
/g/g15/yadav2/ctf/src/interface/idx_tensor.cxx:227
/g/g15/yadav2/ctf/examples/../include/../src/interface/idx_tensor.h:262
/g/g15/yadav2/ctf/examples/spbench.cxx:209 (discriminator 6)
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:299
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:687
/g/g15/yadav2/ctf/examples/spbench.cxx:9 (discriminator 2)
/g/g15/yadav2/ctf/examples/spbench.cxx:208 (discriminator 1)
/g/g15/yadav2/ctf/examples/spbench.cxx:317 (discriminator 7)
??:0
??:0

Both the of "segfaults" are internal assertion failures, as it seems.

raghavendrak commented 2 years ago

The nell-2.tensor dimensions specified are 12092 x 9184 x 28818. I am assuming you are using the same in dims.data(), but if you look at the indices specified in the tensor, there are values with index 28818. Using 12093 X 9185 X 28819 will fix this.

rohany commented 2 years ago

The .tns file format encodes all tensor indices in 1-indexed format. Does the CTF read operation assume they are zero indexed?

solomonik commented 2 years ago

Yes, the documentation I think is consistent with that.

solomonik commented 2 years ago

One fix is to just read a tensor with dims larger by 1 and take a slice starting from 1, I think we did that to preprocess to get results elsewhere

rohany commented 2 years ago

If ctf should be reading in the coordinates correctly, why is incrementing the dimensions necessary? Either way, I’ll give it a try.

solomonik commented 2 years ago

tns files are just one standard

rohany commented 2 years ago

I tested this out with incrementing all of my dimensions by 1 and I'm still running into a segfault on 1 and 40 processes.

raghavendrak commented 2 years ago

Are you seeing the segfault when reading the tensor? (I tried running your code, and we were able to read the tensors on 1 process).

rohany commented 2 years ago

No, it seems to be after the tensors load.

To replicate my exact setup, try running this code: https://github.com/rohany/ctf/blob/master/examples/spbench.cxx (and edit line 287 to be dims.push_back(atoi(it.c_str()) + 1);.

Then, run the binary with arguments: spbench -tensor <path to tns> -dims 12092,9184,28818 -n 20 -warmup 10 -bench spinnerprod -tensorC <path to tns>

raghavendrak commented 2 years ago

The segmentation fault is because CTF runs out of memory for the contraction. Can you try higher node counts? Also, this specific operation (if B == C) can be achieved by computing the Forbenius norm i.e., B.norm2(norm).

rohany commented 2 years ago

I'm skeptical that memory usage is the problem (I usually get a signal 9 from the job scheduler when a process OOMs). I tried running with up to 8 nodes and saw segfaults each time.

Also, this specific operation (if B == C) can be achieved by computing the Forbenius norm i.e., B.norm2(norm).

I'm running when B != C.

raghavendrak commented 2 years ago

CTF calculates the memory usage a priori. If the contraction cannot be performed then an assert is triggered and the computation is aborted. Can you recompile and run your code with -DDEBUG=4 and -DVERBOSE=4. I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()) (based on your code mentioned first here) [nell-2 tensor].

rohany commented 2 years ago

I don't see anything interesting output with those flags on.

The output before the crash is:

CTF: Running with 4 threads
CTF: Total amount of memory available to process 0 is 170956357632
12093
9185
28819
debug:untyped_tensor.cxx:440 Created order 3 tensor ETXS03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 3 tensor AILI03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 0 tensor OBNO00, is_sparse = 0, allocated = 1
New tensor OBNO00 defined of size 1 elms (8 bytes):
printing lens of dense tensor OBNO00:
printing mapping of dense tensor OBNO00
CTF: OBNO00 mapped to order 4 topology with dims: 2  2  2  5
CTF: Tensor mapping is OBNO00[]
printing mapping of sparse tensor ETXS03
CTF: ETXS03 mapped to order 3 topology with dims: 10  2  2
CTF: Tensor mapping is ETXS03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.
printing mapping of sparse tensor AILI03
CTF: AILI03 mapped to order 3 topology with dims: 10  2  2
CTF: Tensor mapping is AILI03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.

and the backtrace is

/g/g15/yadav2/ctf/src/contraction/contraction.cxx:119 (discriminator 3)
/g/g15/yadav2/ctf/src/interface/term.cxx:983
/g/g15/yadav2/ctf/src/interface/idx_tensor.cxx:227
/g/g15/yadav2/ctf/examples/../include/../src/interface/idx_tensor.h:262
/g/g15/yadav2/ctf/examples/spbench.cxx:209 (discriminator 6)
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:299
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:687
/g/g15/yadav2/ctf/examples/spbench.cxx:9 (discriminator 2)
/g/g15/yadav2/ctf/examples/spbench.cxx:208 (discriminator 1)
/g/g15/yadav2/ctf/examples/spbench.cxx:323 (discriminator 7)
??:0
??:0

I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()) (based on your code mentioned first here) [nell-2 tensor].

That was a typo. The load of C shold have used a different input filename.

raghavendrak commented 2 years ago

So if I have to reproduce this, what are the two tensor files I need to use? (I see that both tensors ETSX03 and AILI03 have the same non-zero entries: 76879419?)

rohany commented 2 years ago

I'm currently running it with the same tensor files (nell-2 and nell-2), but I aim to use it for different tensor files once we can resolve the segfault.

raghavendrak commented 2 years ago

CTF runs out of memory for this contraction (with nell-2 tensor as input for both B and C). I tried till 128 nodes with no luck. There is also a possibility of a bug in CTF. With -DDEBUG=4 and -DVERBOSE=4 you should be able to see output similar to below:

debug:contraction.cxx:2942 [EXH] Not enough memory available for topo 2047 with order 1 memory 1778101471/1183301216
ERROR: Failed to map contraction!
rohany commented 2 years ago

Is this something related to the shape of the tensor, or tensor of similar and greater size will also fail? Specifically the other larger tensors in the frostt suite?

raghavendrak commented 2 years ago

My guess is that it has to do with the size and the contraction type. Might have to try other tensors with this contraction to be able to conclude.