Closed AndrewMead10 closed 1 year ago
Hi @AndrewMead10 ,
Thank you for reporting this bug. I was able to reproduce and identify two issues, one concerning our use of cudagraph and another on our kernels tritons. We'll investigate more in the near future.
For now, you can just remove the cuda graphs utility. Faced a similar issue for GPT2 from Transformers. I got around it by changing
return cuda_graphs_wrapper(gm, example_inputs)
to
return gm
here. Although, didn't get much speedup after doing this.
Yes, optimization is very long (more than 1h) and no benefits after. However, I think we'll come with another optimizations in the future to speed up llama models.
For now, you can just remove the cuda graphs utility. Faced a similar issue for GPT2 from Transformers. I got around it by changing
return cuda_graphs_wrapper(gm, example_inputs)
toreturn gm
here. Although, didn't get much speedup after doing this.
Meeting the same issue and result for GPT2. Is there any update on this problem?
Description
When trying to use kernl with Llama 7B, I get an error when capturing the graph.
Steps to reproduce
Expected Behavior
An optimized llama model
Actual Behavior
Your environment
Self-service
Code of Conduct