eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
65 stars 14 forks source link

Investigate trsm performance with tcmalloc #591

Closed msimberg closed 11 months ago

msimberg commented 2 years ago

20k by 20k matrices with 128 blocksize on the triangular solver miniapp using tcmalloc shows worrying behaviour:

> OMP_NUM_THREADS=1 srun -u -N1 -n2 -c18 --hint=nomultithread miniapp/miniapp_triangular_solver --m 20480 --n 20480 --mb 128 --nb 128 --grid-rows 1 --grid-cols 2 --nruns 20 --nwarmups 0 --pika:use-process-mask
[0]
[0] 13.6897s 627.474GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[1]
[1] 18.0077s 477.014GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[2]
[2] 29.0798s 295.392GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[3]
[3] 35.506s 241.929GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[4]
[4] 40.8174s 210.448GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[5]
[5] 42.9886s 199.819GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[6]
[6] 44.1788s 194.436GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[7]
[7] 45.8649s 187.288GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[8]
[8] 47.3743s 181.32GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[9]
[9] 48.3985s 177.484GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[10]
[10] 49.5336s 173.416GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[11]
[11] 50.8611s 168.89GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[12]
[12] 51.5254s 166.713GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[13]
[13] 52.0118s 165.154GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[14]
[14] 52.4876s 163.657GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[15]
[15] 52.8686s 162.477GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[16]
[16] 53.2382s 161.349GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[17]
[17] 53.3924s 160.883GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[18]
[18] 53.6527s 160.102GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC
[19]
[19] 53.6799s 160.021GFlop/s dLLNN (20480, 20480) (128, 128) (1, 2) 18 MC

The first iteration is on par with e.g. mimalloc.

This may be a tcmalloc bug or "deliberate tradeoff" that tcmalloc makes, or it may indicate something sketchy in DLA-Future. It would be good to do at least some investigation to make sure it's not the latter.

Related to #587.

msimberg commented 11 months ago

It would still be interesting to know what's going on here, but we've since then changed the default allocator to mimalloc, and we're unlikely to ever get back to this. Closing.