Closed chenrui17 closed 1 year ago
@chenrui17 parameters were set to zero, but in fact the models has the same memory footprint, since weights are stored as dense tensors
I found the model is running even slower. Is that expected? If size doesn't change and speed is slower, what is the pruning for? Did I miss anything? cc @Godofnothing
As of right now, this is a research-focused repository with the goal of accurately sparsifying GPT-style models. As @Godofnothing is saying, sparse models are currently stored as dense tensors with many weights that are exactly zero. This simulates a sparse model and is standard in sparsity research. There are various other projects focused on actual size reduction and speedups for existing sparse models, e.g. DeepSparse, XNNPACK or CUTLASS (for 2:4 sparsity).
The memory consumption and runtime of the final model should be exactly the same, perhaps some of the memory increases and slowdowns are during the sparsification process itself and/or our layer-by-layer evaluation procedure designed to evaluate large models on a single GPU?
Is there any howto for reduce size of the sparsed model? I tried with DeepSparse, but failed miserably. It seems there's no way how to convert back the DeepSparse-compiled model back to huggingface format.
Great job ! and I recurrence your code ,but i noticed an increase in gpu memory, and i don't understand why , because According to the paper description, the model parameters have been reduced by 50%.