Currently, llama.cpp allows us to pass -i -ins for an interactive chat session using the alpaca template and it also allows us to have gpu offloading via cuda or opencl. This would massively improve inference times. Will it be supported anytime soon as the only thing stopping starcoder from taking off is the huge barrier to entry in a way off inference times. I am very impressed with the model based on testing via the web interface(starchat).
Currently, llama.cpp allows us to pass -i -ins for an interactive chat session using the alpaca template and it also allows us to have gpu offloading via cuda or opencl. This would massively improve inference times. Will it be supported anytime soon as the only thing stopping starcoder from taking off is the huge barrier to entry in a way off inference times. I am very impressed with the model based on testing via the web interface(starchat).