IntelLabs / SpMP

sparse matrix pre-processing library
https://github.com/IntelLabs/SpMP/wiki
Other
81 stars 13 forks source link

SpMV BW changes from run to run #3

Open yubai0827 opened 4 years ago

yubai0827 commented 4 years ago

The following is the result from 7 runs:

========== Run #1 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 14.72 gflops 126.22 gbps MKL SpMV BW 5.22 gflops 44.76 gbps MKL inspector-executor SpMV BW 15.73 gflops 134.91 gbps ========== Run #2 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 9.83 gflops 84.28 gbps MKL SpMV BW 5.77 gflops 49.50 gbps MKL inspector-executor SpMV BW 15.26 gflops 130.88 gbps ========== Run #3 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 10.12 gflops 86.76 gbps MKL SpMV BW 5.24 gflops 44.95 gbps MKL inspector-executor SpMV BW 15.33 gflops 131.50 gbps ========== Run #4 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 19.35 gflops 165.99 gbps MKL SpMV BW 4.83 gflops 41.45 gbps MKL inspector-executor SpMV BW 13.83 gflops 118.65 gbps ========== Run #5 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 19.23 gflops 164.88 gbps MKL SpMV BW 5.47 gflops 46.89 gbps MKL inspector-executor SpMV BW 14.64 gflops 125.59 gbps ========== Run #6 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 10.02 gflops 85.93 gbps MKL SpMV BW 3.66 gflops 31.35 gbps MKL inspector-executor SpMV BW 10.10 gflops 86.60 gbps ========== Run #7 ============ m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 19.28 gflops 165.37 gbps MKL SpMV BW 5.83 gflops 49.97 gbps MKL inspector-executor SpMV BW 15.30 gflops 131.19 gbps

This is running: test/reordering_test

The matrix is webbase-1M: https://sparse.tamu.edu/Williams/webbase-1M

So is it normal that BW changes from run to run dramatically? And sometimes MKL BW is higher and sometimes MKL BW is lower. Are those expected or not? If not, do I miss something?

Many thanks, Yu Bai

jspark1105 commented 4 years ago

Can you tell me specification of the machine and the compiler you used? Did you specify thread affinity as in the comment of the tests? Single-thread performance should be more stable. Since now Skylake and Cascadelake has more cache capacity with non-inclusive LLC, you may want to increase LLC_CAPACITY defined in SpMP/test.hpp and consider using clflush instruction instead of just updating a large array (see https://github.com/pytorch/FBGEMM/blob/master/bench/BenchUtils.h#L42 as an example).

Just curious. Why SpMV performance after reordering not printed?

yubai0827 commented 4 years ago

First, thank you!

machine.log Above is log of "cat /proc/cpuinfo".

make.log Above is make log.

No, I don't explicitly specify thread count. "reordering_test matrix" is the command.

Do you mean that the small LLC capacity might introduce performance variance from run to run? Now LLC_CAPACITY definition is as below: static const size_t LLC_CAPACITY = 32x1024x1024; How large is more appropriate?

I commented out the remaining the codes so nothing is printed out after reordering.

"A->multiplyWithVector(y, x);" is where matrix-vector multiplication happens, isn't it?

Thanks again.

jspark1105 commented 4 years ago

I'd start with a single-thread runs with OMP_NUM_THREADS=1 and please also specify thread affinity to make it more consistent. Then, you can increase the number of threads but I'd stay within a single socket. Please try with 4x of the current LLC_CAPACITY but I think the number of threads and affinity have the biggest impact.

yubai0827 commented 4 years ago

Thank you. How to specify thread affinity? I notice in the comments:

OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1 test/reordering_test web-Google.mtx

yubai0827 commented 4 years ago

Can you please confirm that the matrix-vector multiplication is actually happening in the function: "A->multiplyWithVector(y, x);"? So if we only study the matrix-vector multiplication, we can safely comment out the remaining of reordering_test codes? Can we? Thank you!

jspark1105 commented 4 years ago

Thank you. How to specify thread affinity? I notice in the comments:

OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1 test/reordering_test web-Google.mtx

KMP_AFFINITY=granularity=fine,compact,1 is a reasonable setup to use (using 1 thread per physical core). For single thread run, use OMP_NUM_THREADS=1, and then you can increase up to the number of physical cores in a socket. For more details, please Google KMP_AFFINITY and I'm sure experts inside Intel know much more about this than me :)

Yes, that's where SpMV actually happens. But, in the later code, SpMV is executed again after reordering and that can give you better performance depending on sparsity pattern (MKL inspector-executor also can do reordering so for a fair comparison you'd want to compare with the performance after reordering).

yubai0827 commented 4 years ago

Many thanks. I have tried to run after reordering and get seg fault with OMP_NUM_THREADS=1 as follows:

OMP_NUM_THREADS=1 KMP_AFFINITY=granularity=fine,compact,1 reordering_test webbase-1M.bin m = 1000005 nnz = 3105536 3.105520 bytes = 53266512.000000 original bandwidth 987649 SpMV BW 0.92 gflops 7.86 gbps MKL SpMV BW 0.86 gflops 7.42 gbps MKL inspector-executor SpMV BW 1.24 gflops 10.63 gbps

BFS reordering Constructing permutation takes 0.038738 (0.32 gbps) 0 missed 0 duplicated Permute takes 0.021677 (2.46 gbps) Permuted bandwidth 964586 SpMV BW 1.87 gflops 16.03 gbps

RCM reordering w/o source selection heuristic Segmentation fault

=================== BFS reordering does help improve performance: 1.24 gflops to 1.87 gflops. RCM reodering causes seg fault.

jspark1105 commented 4 years ago

Ignoring the segfault, are you seeing more consistent performance across runs using single thread?

I think the segfault is because BFS/RCM reordering only works for symmetric matrices. If you build with DBG=yes option, you will see assertions like the following. Sorry about the poor error handling. reordering_test: reordering/RCM.cpp:976: void SpMP::CSR::getBFSPermutation(int*, int*): AssertionisSymmetric(false)' failed.`

Actually reordering_test is trying to make the input matrix symmetric as you can see from https://github.com/IntelLabs/SpMP/blob/master/test/reordering_test.cpp#L88 but apparently forceSymmetric is ignored when we're loading from .bin file. For reordering_test, please use .mtx file.

jspark1105 commented 4 years ago

I added 2 more commits. You should be able to see an error message when you try to run reordering_test with *.bin files.

yubai0827 commented 4 years ago

Yes, with the single OMP thread, the performance (gflops) is much lower and stable though not perfect. Many thanks.

yubai0827 commented 4 years ago

Can you please confirm my understanding that if .bin (asymmetric, like webbase-1M.bin) matrix is used with reordering_test, forceSymmetric is ignored, but matrix-vector multiplication is still done correctly, but maybe not ideal performance level? I mean before you made the last two commits. Appreciate your help.

jspark1105 commented 4 years ago

Can you please confirm my understanding that if .bin (asymmetric, like webbase-1M.bin) matrix is used with reordering_test, forceSymmetric is ignored, but matrix-vector multiplication is still done correctly, but maybe not ideal performance level? I mean before you made the last two commits. Appreciate your help.

Yes even with *.bin as inputs, matrix-vector multiplication before reordering is still done correctly.

yubai0827 commented 4 years ago

This is great, thanks again.

yubai0827 commented 4 years ago

I am curious if OMP_NUM_THREADS is not explicitly defined, what is it? Does it depend on workloads? Does it change from run to run? Thanks.

jspark1105 commented 4 years ago

I am curious if OMP_NUM_THREADS is not explicitly defined, what is it? Does it depend on workloads? Does it change from run to run? Thanks.

By default it's typically the number of logical cores available to the system unless you limit process's access via numactl for example. BTW, for general questions not related to SpMP, please ask to other forums or to Intel OpenMP team.

yubai0827 commented 4 years ago

I got it, thank you!