Open claysauruswrecks opened 1 year ago
Can you elaborate?
Sure, trimming involves removing nodes and connections in the network while minimizing accuracy loss. There is also an inference performance gain in both speed and hardware requirements.
Here is one such framework for pruning models, which resulted in the benchmark mentioned above: https://github.com/neuralmagic/deepsparse
Someone is bound to prune the LLaMA derivatives, and I opened this task so others might track or see it and add theirs.
I don't know of any right now, this is just a placeholder for people to fill in if they are aware of such options.
Here is an example of a performance increase from this pruning process: https://github.com/mlcommons/inference_results_v3.0/tree/main/open/NeuralMagic