IST-DASLab / sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
https://arxiv.org/abs/2301.00774
Apache License 2.0
731 stars 97 forks source link

What causes gpu memory increase compred to dense mode?Is this normal? #1

Closed chenrui17 closed 1 year ago

chenrui17 commented 1 year ago

Great job ! and I recurrence your code ,but i noticed an increase in gpu memory, and i don't understand why , because According to the paper description, the model parameters have been reduced by 50%.

Godofnothing commented 1 year ago

@chenrui17 parameters were set to zero, but in fact the models has the same memory footprint, since weights are stored as dense tensors

henrywoo commented 1 year ago

I found the model is running even slower. Is that expected? If size doesn't change and speed is slower, what is the pruning for? Did I miss anything? cc @Godofnothing

efrantar commented 1 year ago

As of right now, this is a research-focused repository with the goal of accurately sparsifying GPT-style models. As @Godofnothing is saying, sparse models are currently stored as dense tensors with many weights that are exactly zero. This simulates a sparse model and is standard in sparsity research. There are various other projects focused on actual size reduction and speedups for existing sparse models, e.g. DeepSparse, XNNPACK or CUTLASS (for 2:4 sparsity).

The memory consumption and runtime of the final model should be exactly the same, perhaps some of the memory increases and slowdowns are during the sparsification process itself and/or our layer-by-layer evaluation procedure designed to evaluate large models on a single GPU?

slush0 commented 1 year ago

Is there any howto for reduce size of the sparsed model? I tried with DeepSparse, but failed miserably. It seems there's no way how to convert back the DeepSparse-compiled model back to huggingface format.