Closed teaalltr closed 1 year ago
I'm just skimming through the paper quickly, I'm no AI expert whatsoever, but I do think I see one problem: the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English, including coding languages which would be pruned completely with their datasets.
I don't think this is necessarily a show-stopping problem, it just means that in order to use this technique for LLaMa we'd need more datasets specifically suited for LLaMa. If you were to use their original dataset, LLaMa would become an English-only LLM.
3.1 To analyze the general redundancy in pre-trained models, we use the Penn Treebank development set (Marcus et al., 1993), which consists of roughly 44,000 tokens. For task-specific analysis, we use two broad categories of downstream tasks – Se- quence Labeling and Sequence Classification tasks. For the sequence labeling tasks, we study core linguistic tasks, i) part-of-speech (POS) tagging using the Penn TreeBank, ii) CCG super tagging using CCGBank (Hockenmaier, 2006), iii) seman- tic tagging (SEM) using Parallel Meaning Bank data (Abzianidze and Bos, 2017) and iv) syn- tactic chunking using CoNLL 2000 shared task dataset (Sang and Buchholz, 2000).
The penn treebank development set
Building a large annotated corpus of English: the Penn Treebank
See also SparseGPT/LLaMa https://github.com/lachlansneff/sparsellama, https://arxiv.org/abs/2301.00774
I didn't dig into it yet just my 10 Cents: llama uses SwiGLU, Bert Gelu and others Relu. SwiGLU seems a super-heavy activation to me which also increases the perceived neuronal density of the network.
Next is the already mentioned factor: if you use a limited training set then you basicaly lobotomize all areas you didn't train at all and you damage those areas you did not train enough and you'll remove a lot of the nuances the model learned making it more "pragmatic" and less "creative". I agree that this type of optimization is interesting but it comes with not so trivial consequences and complications.
@Azeirah
the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English,
this might be true, but it would still be beneficial for specific use cases. (eg english only)
the paper states for llama states:
... performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model.
languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.
just to mention: other Latin languages are very similar to English (same Latin word roots) and that other Cyrillic languages allow the model to be used as a simple translation engine (and make it more interesting for other countries and more people in general).
I believe you can’t just remove it without any damage to the model.
I wonder if pruning can be kind of like a form of finetuning where the resulting model is much smaller. For example, what if one pruned using the instruct data people are finetuning on? Or using the output distributions of a larger model as in knowledge distillation? In the latter case the model could possibly increase in strength rather than decrease.
Today I found there is structured LLaMA pruning code at https://github.com/horseee/LLaMA-Pruning and https://github.com/VainF/Torch-Pruning .
Note that high-quality pruning and quantization finetunes the model during the optimization to reduce the impact on performance. The above approach does not appear to do that.
Has anyone pruned alpaca/vicuna and uploaded it somewhere?
There is also knowledge distillation technique. I would like to se if someone would distill 65B into 7B model.
what’s the sparsification news? (or was this issue closed inaccurately?)
1) @ggerganov have you considered keeping neuron activation statistics?
This could be used to prune (lobotomize) the model and remove unused "knowledge" to reduce the model size, required RAM and improve the inference performance.
The neuron usage statistics could be collected for a given set of use cases (eg. leave the model running in production for some months and then stick to that used knowledge only).
2) other interesting approach that doesn't require lobotomizing the model would be to lazy-load the model weights dynamically by partitions:
When additional knowledge is required (when some sleeping neurons got activated) the model should load and connect those weight partitions.
Unused weight partitions could be removed from memory by disconnecting those unused neurons similar to how the OS manages cache.
I see lazy loading has already been implemented in https://github.com/ggerganov/llama.cpp/pull/613/.
However, I believe this implementation is still loading and processing weights that may not contribute to the final inference result.
Hence, we should distinguish between "used weights" and "relevant weights."
The challenge here would be to dynamically detect which weight partitions will be relevant for the inference process.
One relatively simple initial approach would be:
A more complex approach would involve identifying multiple mappings of relevant weight partitions and dynamically detecting which weights will be required by the subsequent layers. In other words, the model weights would be grouped by "knowledge topics" that are loaded and used only when required.
Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf. They run the experiment with BERT and XLNet. Code for the pruning is provided. There's lots of room for improvement apparently, since LLama is very similar to those. If someone's interested, that could be a nice thing to try 😄