cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
199 stars 53 forks source link

How do I fill a local tensor with different distribution into Cyclops? #31

Closed kannon92 closed 7 years ago

kannon92 commented 7 years ago

Hello everyone,

I wasn't sure if I should email someone about this problem, so I decided to post an issue.

I have been working on writing a Parallel Density-fitting Fock builder using this library. I had a previous code working with Global Arrays (GA), but I was always having a problem with allocating memory and/or other parts of the code behaving oddly. I decided to try and use Cyclops to see if I can fix these problems. Plus, I really like the sparsity handling feature, so I was trying out that feature. My code works in serial (with and without sparsity), but I am having trouble figuring out how to fill a parallel tensor object.

Anyway, I have an array that is distributed via the slowest index (mn|Q), where the Q rows are statically distributed to N processors. I was able to do this quite simply with GA as you can specify the distribution that the tensor is defined on.

In Cyclops, I just want to fill the tensor using the local data available on each processor (in my array) which does not seem to correspond to the distribution that Cyclops is in. I realize that the distribution influences performance for contractions and Parallel linear algebra calls. I tried using write(int64_t, int64t, double_ values) and forcing write to be the distribution that is on my processor, but this throws a segmentation fault when allocating the pairs (https://github.com/solomonik/ctf/blob/master/src/interface/tensor.cxx#L260).

Does write have to be the same dimensions of read_local? Is there another way to write to the tensor given that my local data is not in the same distribution of the tensor.

kannon92 commented 7 years ago

So with a gdb back trace, there is a segmentation fault when posix_memalign is called in memcontrol.cxx:328.

I tried to find an answer to this online, but I am allocating a chunk of memory that is larger than the max value of int and the return value for posix_memalign is 32767. The length of the array that I am allocating is 760,000, but I only passed a size of 47500 to the write.

solomonik commented 7 years ago

Hi Kevin,

The write() call is the right function for the purposes you describe. If posix_memalign() or the assert on the subsequent line fail, my first guess would be you are running out of memory. If you write 47500 doubles, pairs of (int64_t key, double val) will be allocated with a total size of 760000 bytes. Does the code work with a smaller problem size? You mentioned the sequential version works, does that also use the write() function?

The write(), read(), read_local() and the contract functionality in CTF may need to use extra buffers (up to 2-3x the memory of the tensor), so running right below full memory utilization may be problematic (if you have a cluster, the solution is to just use more nodes). You can get rid of one of the buffer allocations, the one that is crashing for you, by using the write() function which takes Pair<> objects (array of structs rather than struct of arrays), in which case no conversion occurs, as this representation is the one used internally. But it may still happen that CTF runs into memory overflow elsewhere, if a factor of two in buffer space is not available.

If the problem is not memory, I can try to reproduce it if you post/send the code. Seems like constructing a small example of this problem should also be viable.

By the way, it is possible to ask CTF to distribute the tensor in a predefined manner, see https://github.com/solomonik/ctf/blob/master/src/interface/tensor.h#L285. Its necessary to define a Partition object which corresponds to the processor grid, and assign indices to define the mapping from tensor indices to processor grid dimensions. Then you can access the local buffer directly as raw_data, and it will match the distribution you request. However, the distribution will be cyclic and will be padded, so if you are just blocking along the Q index, it will be a different ordering. If you always map the Q index in the same way the blocked to cyclic permutation should not matter, but dealing with this may be complex. So, I would not recommend using the default mapping unless read()/write() are observed to be a significant performance bottleneck.

kannon92 commented 7 years ago

I have not got any test to work in parallel.

I am able to run a sequential code with using write. This is not a large problem so I should not even be close to running out of memory. I am running on a cluster with 128 GB of memory and this test case is a double tensor of (115 by 25 by 25), so only about 0.65 MB.

I will try and make a mock code that can reproduce this problem. I will also try using the Pairs allocation. I will post the program here in a few hours.

solomonik commented 7 years ago

I see, I can also recommend trying valgrind, maybe there is a memory corruption earlier somewhere. I also find valgrind to be much easier to use in combination with MPI.

The write() function with separate index and value arrays should work correctly, I doubt switching to Pairs will make a difference.

solomonik commented 7 years ago

If its easy, feel free to post the relevant code snippet rather than extracting a full working sample, that might be sufficient.

kannon92 commented 7 years ago

So I think this is my problem.

I am able to get past the write call without any problems now. I messed up the allocation on different processors. I'm still working on getting my code working, but it is now just giving me a wrong answer, so that should be an easier fix. Thank you for the clarification on how to use Cyclops.

kannon92 commented 7 years ago

Thank you for the help. Everything works great now.

I have an of-topic question:

How does the performance of your tensor library do when there is a large amount of sparsity? I notice that your paper tested it for 16% and lower. I am interested in using this library for implementing something like what is done in this paper: "Journal of Chemical Theory and Computation 2016 12 (7), 3122-3134".

I have the sparse code working and I was going to try different ways of obtaining sparsity. Have you noticed any performance hits if there is a large number of zeros compared to non-zeros?

solomonik commented 7 years ago

You can find some newer results for sparse MP3 in my ISTCP presentation (slides 14, 15) http://solomon2.web.engr.illinois.edu/talks/istcp_jul22_2016.pdf

There we test 10%, 1%, and .1% sparsity. There is a diminishing return from sparsity, but it depends on the particular problem. Above 10% sparsity can actually be slower than the dense case though. In part, this is just due to how fast the MKL routines we are using are (make sure you configure with Intel if you want decent sparse performance by the way).

Also, I should warn that currently CTF can't handle Hadamard/weigh indices with sparse tensors (dense works), sparse contractions have to be `proper' (reducible to matmul). That limitation is described in this github issue, https://github.com/solomonik/ctf/issues/23. Its already the most high priority thing to do in CTF, but let me know if its also necessary for what you are doing (it may be necessary for sparse HF), and I will really try to hurry up on implementing that functionality.

kannon92 commented 7 years ago

So I'm kinda confused about the Hadamard indices,

What kind of tensor contractions work for sparsity?

And, My group (Evangelista lab) are all interested in this library and I know that we have Hadamard indices present in our code already. We don't have to use the sparsity feature but it will be of use to us at some point in the near future.

solomonik commented 7 years ago

Hi Kevin,

Contractions like C["...i..."]=A["...i..."]*B["...i..."] currently don't work of A, B, or C are sparse. If they all are dense it works. If any index appears in only two tensors, all combinations of sparsity are supported.

Summations allow one to do B["ij"]=A["ij"] with sparse B, A though. Its also possible to use custom types to implement C["ij"]=A["ij"]*B["ij"], by forming Tensor P and effectively computing P["ij"]=(A["ij"],B["ij"]) (see Function<> syntax) then C from P.

But if you need something like c["ikl"]=A["ijk"]*B["ijl"], there is not good solution at the moment for sparse A, B. But again, this is high priority and I am hoping to add the functionality in the next couple of months. Let me know if you need it urgently.

kannon92 commented 7 years ago

So I have a working parallel sparse fock builder. However, I do find some performance problems when calling read and write. I am trying to figure out how to change the distribution of my tensor.

I am confused at how to get the Partition object to achieve this.

Let's say I have a 3-dimension tensor (edge length of m, n, n). I want to distribute along the first dimension. I tried doing Partition part(3, {m, n, n}).
Now, I have to get an Idx_Partition. What do I do from here? If I want to distribute along m only, do I just fill the idx member of Idx_Partition to be the blocking I want (ie P0 gets 0th row, P1 gets 1st row, P2 gets 2nd row)? What should I do for the other dimensions?

Sorry for these simple questions. I have a hard time in understanding how the Partition object is used for creating distributions.

solomonik commented 7 years ago

One working example to define a matrix in a specific distribution (you can add to/from it with one in a different distribution to move data quicker than read/write): int plens[] = {pr, pc}; Partition ip(2, plens); Matrix M(nrow, ncol, "ij", ip["ij"], Idx_Partition(), 0, this->wrld, this->sr);

So, you'll want pr=p, pc=1. Then you can access the data via the raw_data() pointer, and it will be distributed cyclically over the mode for which you specified a mapping over p processors. Cyclic over p processors means processor j owns elements {j, j+p, j+2p, ... n-p+j}.

Hope that clarifies things, let me know if you have more questions. Sorry there is no examples for this at the moment, it is relatively new functionality. It has also not been tested that thoroughly, so please let me know if you see issues.

But also, depending on what you need to frequently access the tensor for, it would be even more efficient to do it by applying some Function<>(), i.e. encode it in a sum/contraction with a special elementwise operator. If this seems viable, but unclear, give us more details as to what you need to do.

kannon92 commented 7 years ago

Thank you for your quick response. It is no problem that there aren't any examples. I will play around with this tomorrow and I will let you know if I find any problems.