Open rohany opened 2 years ago
The nell-2.tensor
dimensions specified are 12092 x 9184 x 28818
. I am assuming you are using the same in dims.data()
, but if you look at the indices specified in the tensor, there are values with index 28818. Using 12093 X 9185 X 28819
will fix this.
The .tns file format encodes all tensor indices in 1-indexed format. Does the CTF read operation assume they are zero indexed?
Yes, the documentation I think is consistent with that.
One fix is to just read a tensor with dims larger by 1 and take a slice starting from 1, I think we did that to preprocess to get results elsewhere
If ctf should be reading in the coordinates correctly, why is incrementing the dimensions necessary? Either way, I’ll give it a try.
tns files are just one standard
I tested this out with incrementing all of my dimensions by 1 and I'm still running into a segfault on 1 and 40 processes.
Are you seeing the segfault when reading the tensor? (I tried running your code, and we were able to read the tensors on 1 process).
No, it seems to be after the tensors load.
To replicate my exact setup, try running this code: https://github.com/rohany/ctf/blob/master/examples/spbench.cxx (and edit line 287 to be dims.push_back(atoi(it.c_str()) + 1);
.
Then, run the binary with arguments:
spbench -tensor <path to tns> -dims 12092,9184,28818 -n 20 -warmup 10 -bench spinnerprod -tensorC <path to tns>
The segmentation fault is because CTF runs out of memory for the contraction. Can you try higher node counts?
Also, this specific operation (if B == C
) can be achieved by computing the Forbenius norm i.e., B.norm2(norm)
.
I'm skeptical that memory usage is the problem (I usually get a signal 9 from the job scheduler when a process OOMs). I tried running with up to 8 nodes and saw segfaults each time.
Also, this specific operation (if B == C) can be achieved by computing the Forbenius norm i.e., B.norm2(norm).
I'm running when B != C.
CTF calculates the memory usage a priori. If the contraction cannot be performed then an assert is triggered and the computation is aborted. Can you recompile and run your code with -DDEBUG=4
and -DVERBOSE=4
.
I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()
) (based on your code mentioned first here) [nell-2 tensor].
I don't see anything interesting output with those flags on.
The output before the crash is:
CTF: Running with 4 threads
CTF: Total amount of memory available to process 0 is 170956357632
12093
9185
28819
debug:untyped_tensor.cxx:440 Created order 3 tensor ETXS03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 3 tensor AILI03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 0 tensor OBNO00, is_sparse = 0, allocated = 1
New tensor OBNO00 defined of size 1 elms (8 bytes):
printing lens of dense tensor OBNO00:
printing mapping of dense tensor OBNO00
CTF: OBNO00 mapped to order 4 topology with dims: 2 2 2 5
CTF: Tensor mapping is OBNO00[]
printing mapping of sparse tensor ETXS03
CTF: ETXS03 mapped to order 3 topology with dims: 10 2 2
CTF: Tensor mapping is ETXS03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.
printing mapping of sparse tensor AILI03
CTF: AILI03 mapped to order 3 topology with dims: 10 2 2
CTF: Tensor mapping is AILI03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.
and the backtrace is
/g/g15/yadav2/ctf/src/contraction/contraction.cxx:119 (discriminator 3)
/g/g15/yadav2/ctf/src/interface/term.cxx:983
/g/g15/yadav2/ctf/src/interface/idx_tensor.cxx:227
/g/g15/yadav2/ctf/examples/../include/../src/interface/idx_tensor.h:262
/g/g15/yadav2/ctf/examples/spbench.cxx:209 (discriminator 6)
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:299
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:687
/g/g15/yadav2/ctf/examples/spbench.cxx:9 (discriminator 2)
/g/g15/yadav2/ctf/examples/spbench.cxx:208 (discriminator 1)
/g/g15/yadav2/ctf/examples/spbench.cxx:323 (discriminator 7)
??:0
??:0
I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()) (based on your code mentioned first here) [nell-2 tensor].
That was a typo. The load of C shold have used a different input filename.
So if I have to reproduce this, what are the two tensor files I need to use?
(I see that both tensors ETSX03
and AILI03
have the same non-zero entries: 76879419
?)
I'm currently running it with the same tensor files (nell-2 and nell-2), but I aim to use it for different tensor files once we can resolve the segfault.
CTF runs out of memory for this contraction (with nell-2
tensor as input for both B
and C
). I tried till 128 nodes with no luck. There is also a possibility of a bug in CTF. With -DDEBUG=4
and -DVERBOSE=4
you should be able to see output similar to below:
debug:contraction.cxx:2942 [EXH] Not enough memory available for topo 2047 with order 1 memory 1778101471/1183301216
ERROR: Failed to map contraction!
Is this something related to the shape of the tensor, or tensor of similar and greater size will also fail? Specifically the other larger tensors in the frostt suite?
My guess is that it has to do with the size and the contraction type. Might have to try other tensors with this contraction to be able to conclude.
The following code raises different segfaults depending on the process count (on a single node), when run on nell-2 tensor.
When run with a single process, it segfaults with the following backtrace:
When run with 40 processes (1 process per core on my system): it segfaults with the following backtrace:
Both the of "segfaults" are internal assertion failures, as it seems.