85%+ of the llama model could be redundant

teaalltr commented 1 year ago

Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf. They run the experiment with BERT and XLNet. Code for the pruning is provided. There's lots of room for improvement apparently, since LLama is very similar to those. If someone's interested, that could be a nice thing to try 😄

Azeirah commented 1 year ago

https://github.com/fdalvi/analyzing-redundancy-in-pretrained-transformer-models

Azeirah commented 1 year ago

I'm just skimming through the paper quickly, I'm no AI expert whatsoever, but I do think I see one problem: the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English, including coding languages which would be pruned completely with their datasets.

I don't think this is necessarily a show-stopping problem, it just means that in order to use this technique for LLaMa we'd need more datasets specifically suited for LLaMa. If you were to use their original dataset, LLaMa would become an English-only LLM.

3.1 To analyze the general redundancy in pre-trained models, we use the Penn Treebank development set (Marcus et al., 1993), which consists of roughly 44,000 tokens. For task-specific analysis, we use two broad categories of downstream tasks – Se- quence Labeling and Sequence Classification tasks. For the sequence labeling tasks, we study core linguistic tasks, i) part-of-speech (POS) tagging using the Penn TreeBank, ii) CCG super tagging using CCGBank (Hockenmaier, 2006), iii) seman- tic tagging (SEM) using Parallel Meaning Bank data (Abzianidze and Bos, 2017) and iv) syn- tactic chunking using CoNLL 2000 shared task dataset (Sang and Buchholz, 2000).

The penn treebank development set

Building a large annotated corpus of English: the Penn Treebank

jon-chuang commented 1 year ago

cmp-nct commented 1 year ago

I didn't dig into it yet just my 10 Cents: llama uses SwiGLU, Bert Gelu and others Relu. SwiGLU seems a super-heavy activation to me which also increases the perceived neuronal density of the network.

Next is the already mentioned factor: if you use a limited training set then you basicaly lobotomize all areas you didn't train at all and you damage those areas you did not train enough and you'll remove a lot of the nuances the model learned making it more "pragmatic" and less "creative". I agree that this type of optimization is interesting but it comes with not so trivial consequences and complications.

Green-Sky commented 1 year ago

@Azeirah

the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English,

this might be true, but it would still be beneficial for specific use cases. (eg english only)

the paper states for llama states:

... performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model.

languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.

netsvetaev commented 1 year ago

just to mention: other Latin languages are very similar to English (same Latin word roots) and that other Cyrillic languages allow the model to be used as a simple translation engine (and make it more interesting for other countries and more people in general).

I believe you can’t just remove it without any damage to the model.

xloem commented 1 year ago

I wonder if pruning can be kind of like a form of finetuning where the resulting model is much smaller. For example, what if one pruned using the instruct data people are finetuning on? Or using the output distributions of a larger model as in knowledge distillation? In the latter case the model could possibly increase in strength rather than decrease.

Today I found there is structured LLaMA pruning code at https://github.com/horseee/LLaMA-Pruning and https://github.com/VainF/Torch-Pruning .

Note that high-quality pruning and quantization finetunes the model during the optimization to reduce the impact on performance. The above approach does not appear to do that.

Alumniminium commented 1 year ago

Has anyone pruned alpaca/vicuna and uploaded it somewhere?

ivanstepanovftw commented 1 year ago

There is also knowledge distillation technique. I would like to se if someone would distill 65B into 7B model.

xloem commented 1 year ago

what’s the sparsification news? (or was this issue closed inaccurately?)

kripper commented 1 year ago

1) @ggerganov have you considered keeping neuron activation statistics?

This could be used to prune (lobotomize) the model and remove unused "knowledge" to reduce the model size, required RAM and improve the inference performance.

The neuron usage statistics could be collected for a given set of use cases (eg. leave the model running in production for some months and then stick to that used knowledge only).

2) other interesting approach that doesn't require lobotomizing the model would be to lazy-load the model weights dynamically by partitions:

When additional knowledge is required (when some sleeping neurons got activated) the model should load and connect those weight partitions.

Unused weight partitions could be removed from memory by disconnecting those unused neurons similar to how the OS manages cache.

kripper commented 1 year ago

I see lazy loading has already been implemented in https://github.com/ggerganov/llama.cpp/pull/613/.

However, I believe this implementation is still loading and processing weights that may not contribute to the final inference result.

Hence, we should distinguish between "used weights" and "relevant weights."

The challenge here would be to dynamically detect which weight partitions will be relevant for the inference process.

One relatively simple initial approach would be:

Keep statistics of relevant weights for a set of use cases, e.g., for English prompts only.
Configure the model to load and use only those relevant weights using a map file.

A more complex approach would involve identifying multiple mappings of relevant weight partitions and dynamically detecting which weights will be required by the subsequent layers. In other words, the model weights would be grouped by "knowledge topics" that are loaded and used only when required.

ggerganov / llama.cpp

85%+ of the llama model could be redundant #989