ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.9k stars 9.74k forks source link

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

Closed MillionthOdin16 closed 7 months ago

MillionthOdin16 commented 1 year ago

Just saw this and it seems pretty crazy. I don't know exactly where to put it, but figured is worth discussing. They claim significant performance gains and pretty crazy model compression capabilities. A lot of the interesting information is straight on the readme page that I linked.

Neural Magic Repo Link

Our MLPerf Inference v3.0 submission contains the following results for the BERT-Large SQuAD v1.1 question answering task:

Benchmark Engine Precision Compressed File Size SQuAD v1.1 F1 Score (R=X% of Base Accuracy) Offline Throughput [samples/sec]
BERT-Large Baseline ONNXRuntime FP32 1.3 GB 90.874 (R=100.00%) 4.60
oBERT-Large 99% DeepSparse INT8 38.2 MB 90.03 (R=99.07%) 1367.14
oBERT-MobileBERT 99.9% DeepSparse INT8 19.45 MB 90.80 (R=99.92%) 3275.62
oBERT-MobileBERT 99% DeepSparse INT8 9.56 MB 90.41 (R=99.49%) 5578.73

https://github.com/mlcommons/inference_results_v3.0/blob/main/open/NeuralMagic/README.md

jon-chuang commented 1 year ago

From the linked repo:

unstructured gradual pruning, quantization-aware training, and structural distillation

I think the model layout would be very different, and further, not comparable to llama. But definitely interesting.

slaren commented 1 year ago

This may be interesting: https://github.com/horseee/LLaMA-Pruning

Pruning: The following script globally removes 50% of the dimensions of the LLaMA-7B model, resulting in a lightweight model with 1.72B parameters.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.