Open Sixzero opened 8 months ago
The batch dimension is currently not treated differently as other tensor dimensions. Therefore, the size of the tensor used to trace and compile the function will be the only one working. We could imagine a pass to transform this first dimension to a dynamic shape.
The whole package idea I think is a must have to the julia ecosystem, while it still feels epxerimental. Some of us in the julia community would like to use this to Llama2.jl gpu support, would be curious what the speed gains would be.
Testing the example scripts:
The only change was to increase batch size, from 1 to 128 and things crashed:
Also what I find interesting is if we trace it, then this is working:
I don't know if this helps.
The improvement in allocation can already be seen, so some things are working really well! The output of the second script:
So we can already see operations got fused, which would be even more meaningful on GPU in my opinion. What I find interesting is the allocation number, which seems a little bit high. Also I don't know how could I define "cuda" device, so I could test GPU speed.