Closed mratsim closed 6 years ago
Ah one thing that I missed:
Is the data always contiguous?
If no, do you have a maximum number of dimensions?
As shown by the research on TRIOT (Template Recursive Iterations over Tensors) having a max number of dimensions can allow compile-time unrolling of iterations.
More pragmatically in Arraymancer I choose to support up to rank 7 tensors, having a limit allows me to carry tensor metadata on the stack and not stressing the GC.
Thanks for your submission :) I'll need to take a better look at this, but let me quickly answer the simple questions:
Do we need to rebuild a tensor library from scratch?
No! Just the core functionality of the broadcast itself! So if that tensor library already offers something broadcast alike, you shouldn't use that for the core loop ;) You should also show, that you can construct the lazy representation of the broadcast as a normal user type of the language you target, and that it is fully transparent to the implementor of materialize!
.
@SimonDanisch I feel like there must be a better way to phrase the "No Libraries" rule. Since it keeps confusing people. Possibly just removing it.
The point was to stop anyone from just having the whole implementation of broadcast in the library,
right?
So as to prevent anyone submitting: from GiantUglyLibrary import broadcast
.
I feel like maybe just removing that rule entirely, might lead to less confusion. And if anyone makes a PR that does that, then you can just comment on the PR
@mratsim I am excited by you submitting. Sounds like you have just the expertise to show how this is done.
Julia should specialize on Integer multiplication. (Well I mean in compilation sense at least, since julia specialises on everything.) It may be we are missing a good dispatch opportunity there. Do you mind if someone pings you in the JuliaLang repo?
I feel like maybe just removing that rule entirely, might lead to less confusion.
I did remove it :)
@oxinabox you're welcome to ping me in JuliaLang
The secret sauce is basically recreating a generic BLAS that can be used for integers (or float if a BLAS library is missing) instead of doing naive triple for-loops.
Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.
Julia can use SIMD acceleration for custom structs, if it figures out how to do it ;) So of course not for all types, but I found it to work well for the cases I care about! So to truly impress, user defined types should get SIMD acceleration in your implementation! This allows Julia developers to create packages like ForwardDiff, which creates a custom Dual number type, that still profits from SIMD acceleration :)
Let me know if you have any questions :)
Btw, the challenge was born when somebody told me, that they can easily implement Julia's broadcast in some other language - and I realized that I need a minimal, self contained implementation, to actually set a reference and being able to judge if that statement was true!
So if you can boil down the current broadcast implementation that you already have, in a couple of hours and end up with a self contained, performant toy version that fulfills the requirements, than you're pretty close to winning the challenge :)
A completely different challenge for the Julia implementation would be to get it as fast as your current NIM implementation - disregarding implementation complexity and time to solution ;)
Would you mind to provide a minimal NIM broadcast example with included benchmarking code, that I can run, maybe with a link to some basic install + setup steps? :)
I've added a Nim solution in #3 which allows the following syntax:
when isMainModule:
import math
let x = randomTensor([1, 2, 3], 10)
let y = randomTensor([5, 2], 10)
echo x # (shape: [1, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1, 10, 5, 5, 7, 3]))
echo y # (shape: [5, 2], strides: [2, 1], offset: 0, storage: (data: @[8, 3, 7, 9, 3, 8, 5, 3, 7, 1]))
block: # Simple assignation
let a = broadcast(x, y):
x * y
echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[8, 80, 40, 15, 21, 9, 7, 70, 35, 45, 63, 27, 3, 30, 15, 40, 56, 24, 5, 50, 25, 15, 21, 9, 7, 70, 35, 5, 7, 3]))
block: # Complex multi statement with type conversion
let a = broadcast(x, y):
let c = cos x.float64
let s = sin y.float64
sqrt(c.pow(2) + s.pow(2))
echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1.12727828058919, 1.297255090978019, 1.029220081237957, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.5003057878335346, 0.859191628789455, 1.072346394223034, 0.5584276483137685, 0.8508559734652587, 0.3168265963213802, 1.029220081237957, 1.243864280886628, 1.399612404734566, 1.100664502137075, 1.274196529364651, 1.0, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.8879964266455946, 1.129797339073468, 1.299291561428286]))
block: # Variadic number of types with proc declaration inside
var u, v, w, x, y, z = randomTensor([3, 3], 10)
let c = 2
let a = broadcast(u, v, w, x, y, z):
# ((u * v * w) div c) mod (if not zero (x - y + z) else 42)
proc ifNotZero(val, default: int): int =
if val == 0: default
else: val
let uvw_divc = u * v * w div c
let xmypz = x - y + z
uvw_divc mod ifNotZero(xmypz, 42)
echo a # (shape: [3, 3], strides: [3, 1], offset: 0, storage: (data: @[0, 0, 0, 7, 4, 0, 0, 2, 0]))
Instead of reusing Arraymancer I just reused what is in Nim standard lib.
The main differences with Arraymancer map/broadcast/iterations:
Arraymancer unfortunately doesn't support variadic broadcast yet, only up to 3 inputs. It's planned but the iteration scheme is different and a Nim bug on containers subtypes extraction within macros prevents me from using variadic broadcast with my iteration scheme and clean code.
Iteration scheme:
Arraymancer uses the following strategy per tensor and then get the corresponding coordinates or value (or both). This avoids recurrent getindex
computation at the cost of having an inner if/else branch and having one loop per tensor. It also allows simple for looping for contiguous tensors.
template strided_iteration[T](t: Tensor[T], strider: IterKind): untyped =
## Iterate over a Tensor, displaying data as in C order, whatever the strides.
## Iterator init
var coord = newSeq[int](t.rank) # Coordinates in the n-dimentional space
var backstrides: seq[int] = @[] # Offset between end of dimension and beginning
for i,j in zip(t.strides,t.shape):
backstrides.add(i*(j-1))
var iter_pos = t.offset
## Iterator loop
for i in 0 ..< t.shape.product:
## Templating the return value
when strider == IterKind.Values: yield t.data[iter_pos]
elif strider == IterKind.Coord_Values: yield (coord, t.data[iter_pos])
elif strider == IterKind.MemOffset: yield iter_pos
elif strider == IterKind.MemOffset_Values: yield (iter_pos, t.data[iter_pos])
## Computing the next position
for k in countdown(t.rank - 1,0):
if coord[k] < t.shape[k]-1:
coord[k] += 1
iter_pos += t.strides[k]
break
else:
coord[k] = 0
iter_pos -= backstrides[k]
The broadcast implementation I just pulled iterates on all possible coordinate values and then feed that to getIndex
like the Julia code (I think). The scheme is much simpler, is easier to use with a variabe number of inputs, produces much less code, however moving from one element to the next is more costly due to getIndex
vs simple incrementation.
If you're studying iteration optimization I suggest you look into TRIOT and the resources I've gathered.
Awesome!! Thanks a lot :) Excited to try this out! Could you help me a bit to get started, by only supplying the equivalent of the first benchmarking challenge?
using BenchmarkTools
a = rand(1000, 1000); # 1000x1000 Float64 matrix
b = rand(1000); # 1000 Float64 vector
c = 1.0 # Float64 scalar
out = similar(a); # uninitialized copy of a
br = @broadcast a + b - sin(c) # construct the lazy representation of our call tree
@btime materialize!($out, $br) # materialize the broadcast && benchmark it with some statistical rigour
I think this can be closed
Hello Simon,
I'm the author of Arraymancer, a tensor library written from scratch with Numpy + PyTorch like syntax with a deep learning focus in the Nim programming language.
How would your challenge work with languages without built-in N-dimensional arrays libraries? Do we need to rebuild a tensor library from scratch? Or can we map on the basic sequence/vector type?
On my from scratch tensor lib
A * X + B
into a single GEMM callBasically the only thing missing is variadic number of arguments support (I only hardcoded 2 and 3).
Alternatively on basic seq types
I'm also the author of the loop-fusion library that allows the following with variadic number of arguments:
Loopfusion is a very short lib (300 lines can be brought down a lot)
I can easily rebuild one shorter with OpenMP parallel support just for map (as it support reduce, filter as well) in a couple dozens lines.
Types
Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.
Furthermore Julia doesn't seem to optimize for integer types, for example here are benchmarks I did on integer matrix multiplication.
Archlinux, E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)
MacOS + i5-5257U (Broadwell dual-core mobile 2.7GHz, turbo 3.1)
As you can see Julia is 10x slower (grows by n^3 with matrix size) compared to Nim on my quad-core machine.