Languages without any tensor libraries

mratsim commented 6 years ago

Hello Simon,

I'm the author of Arraymancer, a tensor library written from scratch with Numpy + PyTorch like syntax with a deep learning focus in the Nim programming language.

How would your challenge work with languages without built-in N-dimensional arrays libraries? Do we need to rebuild a tensor library from scratch? Or can we map on the basic sequence/vector type?

On my from scratch tensor lib

Broadcasting with arbitrary dimension support with OpenMP backend, CUDA and OpenCL
Transparent loop-fusion rewrite for example fuse A * X + B into a single GEMM call
User-defined function mapping and inlining, example the backpropagation of a softmax cross-entropy loss
Parallel via OpenMP, Cuda or OpenCL

Basically the only thing missing is variadic number of arguments support (I only hardcoded 2 and 3).

Alternatively on basic seq types

I'm also the author of the loop-fusion library that allows the following with variadic number of arguments:

import loopfusion

block: # Simple
  let a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip x in a, y in b, z in c:
    echo (x + y) * z

  # 120
  # 140
  # 160

block: # With index
  let a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]
  var d: seq[int] = @[]

  forZip i, x in a, y in b, z in c:
    d.add i + x + y + z

  doAssert d == @[22, 25, 28]

block: # With mutation
  var a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip x in var a, y in b, z in c:
    x += y * z

  doAssert a == @[111, 122, 133]

block: # With mutation, index and multiple statements
  var a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip i, x in var a, y in b, z in c:
    let tmp = i * (y - z)
    x += tmp

  doAssert a == @[1, 4, 9]

block: # With iteration on seq of different types
  let a = @[1, 2, 3]
  let b = @[false, true, true]

  forZip integer in a, boolean in b:
    if boolean:
      echo integer

block: # With an expression
  let a = @[1, 2, 3]
  let b = @[4, 5, 6]

  let c = forZip(x in a, y in b):
    x + y

  doAssert c == @[5, 7, 9]

block: # With arrays + seq, mutation, index and multiple statements
  var a = [1, 2, 3]
  let b = [11, 12, 13]
  let c = @[10, 10, 10]

  forZip i, x in var a, y in b, z in c:
    let tmp = i * (y - z)
    x += tmp

  doAssert a == [1, 4, 9]

Loopfusion is a very short lib (300 lines can be brought down a lot)

I can easily rebuild one shorter with OpenMP parallel support just for map (as it support reduce, filter as well) in a couple dozens lines.

Types

Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.

Furthermore Julia doesn't seem to optimize for integer types, for example here are benchmarks I did on integer matrix multiplication.

Archlinux, E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)

Input 1500x1500 random large int64 matrix
Arraymancer 0.2.90 (master branch 2017-10-10)

Language	Speed	Memory
Nim 0.17.3 (devel) + OpenMP	0.36s	55.5 MB
Julia v0.6.0	3.11s	207.6 MB
Python 3.6.2 + Numpy 1.12 compiled from source	8.03s	58.9 MB

MacOS + i5-5257U (Broadwell dual-core mobile 2.7GHz, turbo 3.1)

Input 1500x1500 random large int64 matrix
Arraymancer 0.2.90 (master branch 2017-10-31)

no OpenMP compilation: nim c -d:native -d:release --out:build/integer_matmul --nimcache:./nimcache benchmarks/integer_matmul.nim
with OpenMP: nim c -d:openmp --cc:gcc --gcc.exe:"/usr/local/bin/gcc-6" --gcc.linkerexe:"/usr/local/bin/gcc-6"  -d:native -d:release --out:build/integer_matmul --nimcache:./nimcache benchmarks/integer_matmul.nim

Language	Speed	Memory
Nim 0.18.0 (devel) - GCC 6 + OpenMP	0.95s	71.9 MB
Nim 0.18.0 (devel) - Apple Clang 9 - no OpenMP	1.73s	71.7 MB
Julia v0.6.0	4.49s	185.2 MB
Python 3.5.2 + Numpy 1.12	9.49s	55.8 MB

As you can see Julia is 10x slower (grows by n^3 with matrix size) compared to Nim on my quad-core machine.

mratsim commented 6 years ago

Ah one thing that I missed:

Is the data always contiguous?

If no, do you have a maximum number of dimensions?

As shown by the research on TRIOT (Template Recursive Iterations over Tensors) having a max number of dimensions can allow compile-time unrolling of iterations.

More pragmatically in Arraymancer I choose to support up to rank 7 tensors, having a limit allows me to carry tensor metadata on the stack and not stressing the GC.

SimonDanisch commented 6 years ago

Thanks for your submission :) I'll need to take a better look at this, but let me quickly answer the simple questions:

Do we need to rebuild a tensor library from scratch?

No! Just the core functionality of the broadcast itself! So if that tensor library already offers something broadcast alike, you shouldn't use that for the core loop ;) You should also show, that you can construct the lazy representation of the broadcast as a normal user type of the language you target, and that it is fully transparent to the implementor of materialize!.

oxinabox commented 6 years ago

@SimonDanisch I feel like there must be a better way to phrase the "No Libraries" rule. Since it keeps confusing people. Possibly just removing it.

The point was to stop anyone from just having the whole implementation of broadcast in the library, right? So as to prevent anyone submitting: from GiantUglyLibrary import broadcast.

I feel like maybe just removing that rule entirely, might lead to less confusion. And if anyone makes a PR that does that, then you can just comment on the PR

@mratsim I am excited by you submitting. Sounds like you have just the expertise to show how this is done.

Julia should specialize on Integer multiplication. (Well I mean in compilation sense at least, since julia specialises on everything.) It may be we are missing a good dispatch opportunity there. Do you mind if someone pings you in the JuliaLang repo?

SimonDanisch commented 6 years ago

I feel like maybe just removing that rule entirely, might lead to less confusion.

I did remove it :)

mratsim commented 6 years ago

@oxinabox you're welcome to ping me in JuliaLang

The secret sauce is basically recreating a generic BLAS that can be used for integers (or float if a BLAS library is missing) instead of doing naive triple for-loops.

SimonDanisch commented 6 years ago

Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.

Julia can use SIMD acceleration for custom structs, if it figures out how to do it ;) So of course not for all types, but I found it to work well for the cases I care about! So to truly impress, user defined types should get SIMD acceleration in your implementation! This allows Julia developers to create packages like ForwardDiff, which creates a custom Dual number type, that still profits from SIMD acceleration :)

SimonDanisch commented 6 years ago

Let me know if you have any questions :)

Btw, the challenge was born when somebody told me, that they can easily implement Julia's broadcast in some other language - and I realized that I need a minimal, self contained implementation, to actually set a reference and being able to judge if that statement was true!

So if you can boil down the current broadcast implementation that you already have, in a couple of hours and end up with a self contained, performant toy version that fulfills the requirements, than you're pretty close to winning the challenge :)

SimonDanisch commented 6 years ago

A completely different challenge for the Julia implementation would be to get it as fast as your current NIM implementation - disregarding implementation complexity and time to solution ;)

Would you mind to provide a minimal NIM broadcast example with included benchmarking code, that I can run, maybe with a link to some basic install + setup steps? :)

mratsim commented 6 years ago

I've added a Nim solution in #3 which allows the following syntax:

when isMainModule:
  import math

  let x = randomTensor([1, 2, 3], 10)
  let y = randomTensor([5, 2], 10)

  echo x # (shape: [1, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1, 10, 5, 5, 7, 3]))
  echo y # (shape: [5, 2], strides: [2, 1], offset: 0, storage: (data: @[8, 3, 7, 9, 3, 8, 5, 3, 7, 1]))

  block: # Simple assignation
    let a = broadcast(x, y):
      x * y

    echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[8, 80, 40, 15, 21, 9, 7, 70, 35, 45, 63, 27, 3, 30, 15, 40, 56, 24, 5, 50, 25, 15, 21, 9, 7, 70, 35, 5, 7, 3]))

  block: # Complex multi statement with type conversion
    let a = broadcast(x, y):
      let c = cos x.float64
      let s = sin y.float64

      sqrt(c.pow(2) + s.pow(2))

    echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1.12727828058919, 1.297255090978019, 1.029220081237957, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.5003057878335346, 0.859191628789455, 1.072346394223034, 0.5584276483137685, 0.8508559734652587, 0.3168265963213802, 1.029220081237957, 1.243864280886628, 1.399612404734566, 1.100664502137075, 1.274196529364651, 1.0, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.8879964266455946, 1.129797339073468, 1.299291561428286]))

  block: # Variadic number of types with proc declaration inside
    var u, v, w, x, y, z = randomTensor([3, 3], 10)

    let c = 2

    let a = broadcast(u, v, w, x, y, z):
      # ((u * v * w) div c) mod (if not zero (x - y + z) else 42)

      proc ifNotZero(val, default: int): int =
        if val == 0: default
        else: val

      let uvw_divc = u * v * w div c
      let xmypz = x - y + z

      uvw_divc mod ifNotZero(xmypz, 42)

    echo a # (shape: [3, 3], strides: [3, 1], offset: 0, storage: (data: @[0, 0, 0, 7, 4, 0, 0, 2, 0]))

Instead of reusing Arraymancer I just reused what is in Nim standard lib.

The main differences with Arraymancer map/broadcast/iterations:

Arraymancer unfortunately doesn't support variadic broadcast yet, only up to 3 inputs. It's planned but the iteration scheme is different and a Nim bug on containers subtypes extraction within macros prevents me from using variadic broadcast with my iteration scheme and clean code.

Iteration scheme:

Arraymancer uses the following strategy per tensor and then get the corresponding coordinates or value (or both). This avoids recurrent getindex computation at the cost of having an inner if/else branch and having one loop per tensor. It also allows simple for looping for contiguous tensors.

template strided_iteration[T](t: Tensor[T], strider: IterKind): untyped =
  ## Iterate over a Tensor, displaying data as in C order, whatever the strides.

  ## Iterator init
  var coord = newSeq[int](t.rank) # Coordinates in the n-dimentional space
  var backstrides: seq[int] = @[] # Offset between end of dimension and beginning
  for i,j in zip(t.strides,t.shape):
    backstrides.add(i*(j-1))

  var iter_pos = t.offset

  ## Iterator loop
  for i in 0 ..< t.shape.product:

    ## Templating the return value
    when strider == IterKind.Values: yield t.data[iter_pos]
    elif strider == IterKind.Coord_Values: yield (coord, t.data[iter_pos])
    elif strider == IterKind.MemOffset: yield iter_pos
    elif strider == IterKind.MemOffset_Values: yield (iter_pos, t.data[iter_pos])

    ## Computing the next position
    for k in countdown(t.rank - 1,0):
      if coord[k] < t.shape[k]-1:
        coord[k] += 1
        iter_pos += t.strides[k]
        break
      else:
        coord[k] = 0
        iter_pos -= backstrides[k]

The broadcast implementation I just pulled iterates on all possible coordinate values and then feed that to getIndex like the Julia code (I think). The scheme is much simpler, is easier to use with a variabe number of inputs, produces much less code, however moving from one element to the next is more costly due to getIndex vs simple incrementation.

If you're studying iteration optimization I suggest you look into TRIOT and the resources I've gathered.

SimonDanisch commented 6 years ago

Awesome!! Thanks a lot :) Excited to try this out! Could you help me a bit to get started, by only supplying the equivalent of the first benchmarking challenge?

using BenchmarkTools
a = rand(1000, 1000); # 1000x1000 Float64 matrix
b = rand(1000); # 1000 Float64 vector
c = 1.0 # Float64 scalar
out = similar(a); # uninitialized copy of a
br = @broadcast a + b - sin(c) # construct the lazy representation of our call tree
@btime materialize!($out, $br) # materialize the broadcast && benchmark it with some statistical rigour

mratsim commented 6 years ago

I think this can be closed

SimonDanisch / julia-challenge