cyclops-community / ctf

Cyclops Tensor Framework: parallel arithmetic on multidimensional arrays
Other
199 stars 53 forks source link

Slicing: compare number of processors correctly #76

Open alejandrogallo opened 4 years ago

alejandrogallo commented 4 years ago

Hello,

we updated our version of CTF and were having some issues with regard to the performance of the slicing in our code.

After some sniffing around we found this line in the slice method. In version 1.4.1 the if statement had the < operator in it. Some time after that this was changed into <= which according to my understanding renders the else codeblock useless, i.e.

...
} else {
      tsr_A->read_local_nnz(&sz_A, &all_data_A);
//      printf("sz_A+%ld\n",sz_A);
}
...

This means that when the number of processors where tsr_B is distributed among is equal to tsr_A, there is also a checking of the dimensions and padding for A, and this means that CTF has to read the data from A,

...
      tsr_A->write(blk_sz_B, sr->mulid(), sr->addid(), blk_data_B, 'r');
...

which makes this block slower.

If this is true, this pull request would be a fix to the problem. We have certainly tested it and it confirmed our suspicion. For slices of big tensors the difference between the < and the <= version is up to 50% in time, according to our benchmarks. However, for small tensors it appears to be roughly equivalent, which makes sense.

Thank you very much for your great project!

solomonik commented 4 years ago

Thanks a lot Alejandro!

These changes seem to make the python tests fail though, likely due to some subtleties involving sparsity, so I will need to investigate/adjust a bit before merging.