Re-optimize matrix multiplication

Threaded matrix multiplication should use the thread pool (that goes for acb_poly/powsum_series_naive_threaded and acb_dft/rad2_threaded, BTW)
Re-tune cutoffs (both single-threaded and multi-threaded), in light of recent improvements (or regressions?) in FLINT's matrix multiplication

The commit f7b6577b7a196ed133a0e5dbf89345c9df58f61d works around a performance problem where erf(x) at 1 million digits is slower with 8 threads than with 1 thread. With correctly tuned matrix multiplication this should be changed back to arb_mat_mul.

flintlib / arb

Re-optimize matrix multiplication #428