Extra challenge: Sliced/non-contiguous tensor broadcasting.

mratsim commented 6 years ago

The current challenge is great to stress the extensibility of a language (lifting functions on variadic container inputs) but it does not really stress performance as a naïve for loop on the buffers with the same index "i" is enough.

NdArrays/tensors are very often sliced, creating a non-contiguous view over the memory which requires a stride-aware iteration scheme which is often quite costly.

I think it would be great to have another challenge with element-wise operations on sliced tensors.

See the following benchmark from TRIOT (issue #7)

SimonDanisch commented 6 years ago

Would you be so kind to supply a very terse, pseudocode benchmark? I can't even find the code for the linked benchmarks ^^ I expect the challenge to look something like this:

A = rand(1000, 1000)
b = view(A, 1, 1000:-1:1) #  slice out 1 dim, e.g. reverse 2nd dim
@time A ./ b

Is this accurate?

It feels to me like TRIOT is solving an issue, we don't have to begin with in Julia - but I'm not a 100% sure, so this will be fun to explore :)

mratsim commented 6 years ago

I'm starting to refactor the backend of my tensor library. I have been playing this past weekend with strided_iteration schemes and here are my benchmarks on non-contiguous tensors.

Warmup: 1.1933 s, result 224 (displayed to avoid compiler optimizing warmup away)

# 4 Contiguous input tensors
<...> removed, if tensors are contiguous a simple for loop should be used.

# 4 NON-contiguous input tensors

Global reference iteration - float64
Collected 1000 samples in 49.648 seconds
Average time: 49.644ms
Stddev  time: 2.316ms
Min     time: 47.169ms
Max     time: 78.987ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Per tensor reference iteration - float64
Collected 1000 samples in 36.795 seconds
Average time: 36.790ms
Stddev  time: 1.175ms
Min     time: 34.855ms
Max     time: 49.315ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Global TRIOT iteration - float64
Collected 1000 samples in 47.085 seconds
Average time: 47.080ms
Stddev  time: 1.337ms
Min     time: 45.313ms
Max     time: 68.331ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Fused per tensor reference iteration - float64
Collected 1000 samples in 30.588 seconds
Average time: 30.583ms
Stddev  time: 1.169ms
Min     time: 28.384ms
Max     time: 41.547ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Global reference iteration is the same as the one I submitted to the Julia challenge, per tensor reference iteration is the same as the one I use in my tensor library. Triot (#7) is similar to hardcoded nested for loops and which level of nesting is selected at runtime:

case rank(x)
of 0:
  echo [x[[]]]
  x[[]] += y[[]]
of 1:
  for iter_dim0_267884 in 0 ..< x.shape[0]:
    x[[iter_dim0_267884]] += y[[iter_dim0_267884]]
of 2:
  for iter_dim1_268264 in 0 ..< x.shape[0]:
    for iter_dim0_268366 in 0 ..< x.shape[1]:
      x[[iter_dim1_268264, iter_dim0_268366]] += y[[iter_dim1_268264, iter_dim0_268366]]
of 3:
  for iter_dim2_268746 in 0 ..< x.shape[0]:
    for iter_dim1_268848 in 0 ..< x.shape[1]:
      for iter_dim0_268950 in 0 ..< x.shape[2]:
        x[[iter_dim2_268746, iter_dim1_268848, iter_dim0_268950]] += y[[iter_dim2_268746, iter_dim1_268848, iter_dim0_268950]]

Basically global means that I keep track of a cartesian coordinates array and then I use tensor[coordinates] for indexing (similar to Julia), per tensor means I work directly with the data pointer offset that I increment with dimensions' strides.

per_tensor iterations have a shortcut if input tensors are contiguous to do a simple for loop for speed that could be used for global iterations as well so that part of the comparison should be ignored.

In conclusion, using cartesian indexes is much less efficient than directly using the pointer offsets even if the tensor is not contiguous and strides increments are random.

The actual bench:

I just use transpose on some tensors so that they are non contiguous:

proc mainBench_global(a, b, c: Tensor, nb_samples: int) =
  bench("Global reference iteration"):
    materialize(output, a, b, c):
      a + b - sin c

proc mainBench_perTensor(a, b, c: Tensor, nb_samples: int) =
  bench("Per tensor reference iteration"):
    forEach o in output, x in a, y in b, z in c:
      o = x + y - sin z

proc mainBench_global_triot(a, b, c: Tensor, nb_samples: int) =
  bench("Global TRIOT iteration"):
    triotForEach o in output, x in a, y in b, z in c:
      o = x + y - sin z

proc mainBench_fusedperTensor(a, b, c: Tensor, nb_samples: int) =
  bench("Fused per tensor reference iteration"):
    fusedForEach o in output, x in a, y in b, z in c:
      o = x + y - sin z

when isMainModule:
  randomize(42) # For reproducibility
  warmup()
  block: # All contiguous
    let
      a = randomTensor([1000, 1000], 1.0)
      b = randomTensor([1000, 1000], 1.0)
      c = randomTensor([1000, 1000], 1.0)
    mainBench_global(a, b, c, 1000)
    mainBench_perTensor(a, b, c, 1000)
    mainBench_global_triot(a, b, c, 1000)
    mainBench_fusedperTensor(a, b, c, 1000)

  block: # Non C contiguous (but no Fortran contiguous fast-path)
    let
      a = randomTensor([100, 10000], 1.0)
      b = randomTensor([10000, 100], 1.0).transpose
      c = randomTensor([10000, 100], 1.0).transpose
    mainBench_global(a, b, c, 1000)
    mainBench_perTensor(a, b, c, 1000)
    mainBench_global_triot(a, b, c, 1000)
    mainBench_fusedperTensor(a, b, c, 1000)

SimonDanisch / julia-challenge

Extra challenge: Sliced/non-contiguous tensor broadcasting. #8

The actual bench: