Pangoraw / Coil.jl

꩜ Lift Julia array operations to MLIR dialects and run using IREE.
MIT License
41 stars 0 forks source link

Batch size is not dynamic #3

Open Sixzero opened 8 months ago

Sixzero commented 8 months ago

The whole package idea I think is a must have to the julia ecosystem, while it still feels epxerimental. Some of us in the julia community would like to use this to Llama2.jl gpu support, would be curious what the speed gains would be.

Testing the example scripts:

using Coil, Flux

dense = Dense(3, 6, relu)
compiled_dense = Coil.compile(dense)

x = randn(Float32,3,1);

res = dense(x)
res = compiled_dense(x)   # works

x = randn(Float32, 3, 128)

res = dense(x)
res = compiled_dense(x)   # crashes

The only change was to increase batch size, from 1 to 128 and things crashed:

ERROR: iree/runtime/src/iree/modules/hal/utils/buffer_diagnostics.c:225: INVALID_ARGUMENT; input1 shape dimension 0 mismatch; expected 1 but have 3; expected shape `1x3`, actual shape `3x3`; while invoking native function hal.buffer_view.assert; while calling import; 
[ 1]   native hal.buffer_view.assert:0 -
[ 0] bytecode module.Dense:366 /home/six/.julia/packages/Flux/ljuc2/src/layers/basic.jl:170:0

Also what I find interesting is if we trace it, then this is working:


import Coil.Tracing
import Coil.IREE
_, tape = Tracing.trace(dense, x; ctx=Tracing.Context(dense));
compiled_tape = Tracing.compile_tape(tape, x; verbose=true, device=IREE.Device("local-task"), hal_target="llvm-cpu", allow_scalar_args=false)

@time res = dense(x)
@show sum(res)
res = compiled_tape(x)  # just a warmup run
@time res = compiled_tape(x)
@show sum(res)

I don't know if this helps.

The improvement in allocation can already be seen, so some things are working really well! The output of the second script:

  0.000020 seconds (2 allocations: 6.250 KiB)
sum(res) = 241.41396f0
  0.000237 seconds (46 allocations: 1.867 KiB)
sum(res) = 241.41406f0

So we can already see operations got fused, which would be even more meaningful on GPU in my opinion. What I find interesting is the allocation number, which seems a little bit high. Also I don't know how could I define "cuda" device, so I could test GPU speed.

Pangoraw commented 7 months ago

The batch dimension is currently not treated differently as other tensor dimensions. Therefore, the size of the tensor used to trace and compile the function will be the only one working. We could imagine a pass to transform this first dimension to a dynamic shape.