Open mratsim opened 6 years ago
Would you be so kind to supply a very terse, pseudocode benchmark? I can't even find the code for the linked benchmarks ^^ I expect the challenge to look something like this:
A = rand(1000, 1000)
b = view(A, 1, 1000:-1:1) # slice out 1 dim, e.g. reverse 2nd dim
@time A ./ b
Is this accurate?
It feels to me like TRIOT is solving an issue, we don't have to begin with in Julia - but I'm not a 100% sure, so this will be fun to explore :)
I'm starting to refactor the backend of my tensor library. I have been playing this past weekend with strided_iteration schemes and here are my benchmarks on non-contiguous tensors.
Warmup: 1.1933 s, result 224 (displayed to avoid compiler optimizing warmup away)
# 4 Contiguous input tensors
<...> removed, if tensors are contiguous a simple for loop should be used.
# 4 NON-contiguous input tensors
Global reference iteration - float64
Collected 1000 samples in 49.648 seconds
Average time: 49.644ms
Stddev time: 2.316ms
Min time: 47.169ms
Max time: 78.987ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
Per tensor reference iteration - float64
Collected 1000 samples in 36.795 seconds
Average time: 36.790ms
Stddev time: 1.175ms
Min time: 34.855ms
Max time: 49.315ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
Global TRIOT iteration - float64
Collected 1000 samples in 47.085 seconds
Average time: 47.080ms
Stddev time: 1.337ms
Min time: 45.313ms
Max time: 68.331ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
Fused per tensor reference iteration - float64
Collected 1000 samples in 30.588 seconds
Average time: 30.583ms
Stddev time: 1.169ms
Min time: 28.384ms
Max time: 41.547ms
Display output[[0,0]] to make sure it's not optimized away
1.143903810108473
Global reference iteration
is the same as the one I submitted to the Julia challenge, per tensor reference iteration
is the same as the one I use in my tensor library. Triot (#7) is similar to hardcoded nested for loops and which level of nesting is selected at runtime:
case rank(x)
of 0:
echo [x[[]]]
x[[]] += y[[]]
of 1:
for iter_dim0_267884 in 0 ..< x.shape[0]:
x[[iter_dim0_267884]] += y[[iter_dim0_267884]]
of 2:
for iter_dim1_268264 in 0 ..< x.shape[0]:
for iter_dim0_268366 in 0 ..< x.shape[1]:
x[[iter_dim1_268264, iter_dim0_268366]] += y[[iter_dim1_268264, iter_dim0_268366]]
of 3:
for iter_dim2_268746 in 0 ..< x.shape[0]:
for iter_dim1_268848 in 0 ..< x.shape[1]:
for iter_dim0_268950 in 0 ..< x.shape[2]:
x[[iter_dim2_268746, iter_dim1_268848, iter_dim0_268950]] += y[[iter_dim2_268746, iter_dim1_268848, iter_dim0_268950]]
Basically global
means that I keep track of a cartesian coordinates
array and then I use tensor[coordinates]
for indexing (similar to Julia), per tensor
means I work directly with the data pointer offset that I increment with dimensions' strides.
per_tensor
iterations have a shortcut if input tensors are contiguous to do a simple for loop for speed that could be used for global
iterations as well so that part of the comparison should be ignored.
In conclusion, using cartesian indexes is much less efficient than directly using the pointer offsets even if the tensor is not contiguous and strides increments are random.
I just use transpose on some tensors so that they are non contiguous:
proc mainBench_global(a, b, c: Tensor, nb_samples: int) =
bench("Global reference iteration"):
materialize(output, a, b, c):
a + b - sin c
proc mainBench_perTensor(a, b, c: Tensor, nb_samples: int) =
bench("Per tensor reference iteration"):
forEach o in output, x in a, y in b, z in c:
o = x + y - sin z
proc mainBench_global_triot(a, b, c: Tensor, nb_samples: int) =
bench("Global TRIOT iteration"):
triotForEach o in output, x in a, y in b, z in c:
o = x + y - sin z
proc mainBench_fusedperTensor(a, b, c: Tensor, nb_samples: int) =
bench("Fused per tensor reference iteration"):
fusedForEach o in output, x in a, y in b, z in c:
o = x + y - sin z
when isMainModule:
randomize(42) # For reproducibility
warmup()
block: # All contiguous
let
a = randomTensor([1000, 1000], 1.0)
b = randomTensor([1000, 1000], 1.0)
c = randomTensor([1000, 1000], 1.0)
mainBench_global(a, b, c, 1000)
mainBench_perTensor(a, b, c, 1000)
mainBench_global_triot(a, b, c, 1000)
mainBench_fusedperTensor(a, b, c, 1000)
block: # Non C contiguous (but no Fortran contiguous fast-path)
let
a = randomTensor([100, 10000], 1.0)
b = randomTensor([10000, 100], 1.0).transpose
c = randomTensor([10000, 100], 1.0).transpose
mainBench_global(a, b, c, 1000)
mainBench_perTensor(a, b, c, 1000)
mainBench_global_triot(a, b, c, 1000)
mainBench_fusedperTensor(a, b, c, 1000)
The current challenge is great to stress the extensibility of a language (lifting functions on variadic container inputs) but it does not really stress performance as a naïve for loop on the buffers with the same index "i" is enough.
NdArrays/tensors are very often sliced, creating a non-contiguous view over the memory which requires a stride-aware iteration scheme which is often quite costly.
I think it would be great to have another challenge with element-wise operations on sliced tensors.
See the following benchmark from TRIOT (issue #7)