Make u128 range use widening multiply

I should just have done this with the previous PR.

The function to calculate a widening multiply for u64 using 32-bit multiples is now a macro, so it can also be used for u128 with 64-bit multiplies. It is not yet optimal on 32-bit architectures, but much better than what we had.

Benchmarks before:

test distr_range_i128         ... bench:     141,265 ns/iter (+/- 3,125) = 113 MB/s (x86_64)
test distr_range_i128         ... bench:     399,455 ns/iter (+/- 6,462) = 40 MB/s (x86)

After:

test distr_range_i128         ... bench:       9,076 ns/iter (+/- 103) = 1762 MB/s (x86_64)
test distr_range_i128         ... bench:      55,194 ns/iter (+/- 472) = 289 MB/s (x86)

dhardy / rand

Make u128 range use widening multiply #79