EleutherAI / project-menu

See the issue board for the current status of active and prospective projects!
65 stars 4 forks source link

[Project] Catastrophic forgetting in Transformers #10

Closed kcoost closed 2 years ago

kcoost commented 3 years ago
kcoost commented 3 years ago

@StellaAthena could you add the labels [help wanted], [Newbies welcome] and [recruiting ML dev] to this project

StellaAthena commented 3 years ago

I have taken a transformer, fine-tuned it on a task, and evaluated it on a different task. Every time I’ve done that, performance drops.

Are there particular levels of performance loss you would like to see?

Jeevesh8 commented 3 years ago

Hypothesis: The performance on downstream task depends only on performance for Language Modelling. The performance on any downstream task, can be attributed entirely to the relation between language modelling objective and the fine-tuning objective, irrespective of model structure. Based on "A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks".

Proposition:

  1. If the hypothesis is true, it would be more logical to check how fast you can recover performance on language modelling/how much performance on language modelling remains after fine-tuning, instead of doing for all pairs of tasks.

  2. Also, the paper above, can help you form intuitions regarding how two tasks must be related, for the performance to loss to be low.

StellaAthena commented 3 years ago

Hypothesis: The performance on downstream task depends only on performance for Language Modelling. Based on "A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks".

Proposition:

  1. If the hypothesis is true, it would be more logical to check how fast you can recover performance on language modelling/how much performance on language modelling remains after fine-tuning, instead of doing for all pairs of tasks.
  2. Also, the paper above, can help you form intuitions regarding how two tasks must be related, for the performance to loss to be low.

I think that this (as worded) is false. Can you make an concrete prediction that I can try to falsify?

Jeevesh8 commented 3 years ago

So suppose there are 2 models, which obtain same loss on language modelling task. Now they may perform differently on downstream task. Despite the fact that the relation between downstream and pre-training task remains same. The hypothesis ignores this phenomenon, and say "a model that performs better on language modelling has a better chance of performing good on downstream task".

An example

Suppose you want to do text sentiment classification. So, you send the text to be classified, concatenated with ". I feel [MASK]" as prompt to the pre-trained model, and now see what the next word predicted is, and use that as label for the sentiment. The better the model is at language modelling, the better it would be at sentiment classification. Irrespective of what the underlying model is. On the other hand, had we taken some external data, and fine-tuned 2 pre-trained models that give exact same predictions for each context initially, the performance on sentiment classification may still end up being different for both models. This is due to the fact that the performance depends on both the relation between language modelling and sentiment classification tasks as well as the model structure. The hypothesis assumes that it only depends on the former

The hypothesis starts to become less and less true, as you finetune more and more. But still the concept that how fast you could recover performance on language modelling would serve as a good way to find how much does fintuning for particular task takes away performance on any other task.

StellaAthena commented 3 years ago

So suppose there are 2 models, which obtain same loss on language modelling task. Now they may perform differently on downstream task. Despite the fact that the relation between downstream and pre-training task remains same. You ignore this phenomenon, and say "a model that performs better on language modelling has a better chance of performing good on downstream task". An example: Suppose you want to do text sentiment classification. So, you send the text to be classified, concatenated with ". I feel [MASK]" as prompt to the pre-trained model, and now see what the next word predicted is. The better the model is at language modelling, the better would it be at sentiment classification. Irrespective of how what the underlying objective is.

The hypothesis starts to become less and less true, as you finetune more and more. But still, I think how fast you could recover performance on language modelling would serve as a good way to find how much does finetuning for particular task takes away performance on all other tasks, combined.

Can you describe an experiment I can run that would potentially disprove this hypothesis?

Jeevesh8 commented 3 years ago

Related Papers with Small Abstracts

  1. On the interplay between fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-trained Transformers

Abstract : Studies BERT, RoBERTa, and ALBERT, and investi-gate through sentence-level probing how fine-tuning affects their pre-trained representations. Finds that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. The changes to the pre-trained representations are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy and leads to an accuracy that is larger than just using the pre-trained model with a strong pooling method.

  1. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties

Abstract : “Downstream” tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. Introduces 10 probing tasks designed to capture simple linguistic features of sentences at various levels, viz., surface information(length of sentence, what words are in it? etc.), syntactic information(depth of parse tree, what is the nature(noun/verb etc.) of the top constituents of the parse tree of the sentence?), semantic information(what is the tesne of the main verb in sentence?, what is the number of the main subject of sentence? etc.).

  1. What Happens To BERT Embeddings During Fine-tuning?

Abstract : Uses probing classifiers, Representational Similarity Analysis, and model ablations(freezing n initial layers or finetuning at the n-th layer's embeddings). Investigate how fine-tuning affects the representations of the BERT model. Claims that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena(as performance of probes only changes at top-layers). Finds that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing re-configures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Also finds that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

  1. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Abstract : Despite the strong empirical performance of fine-tuned models, finetuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to ex-plain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties(training loss constant during failed runs!) that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches.

P.S. : Have left hypothes.is annotations on the above papers, if you want to see.

Jeevesh8 commented 3 years ago

There was also, very similar attempt before in Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?, but the results, imho, aren't that interesting(compared to above papers). But these people weren't aware of other work above, at the time they did their work. Probably can combine their works with previous ones, to get interesting results.

Are you still planning to work on this @kip @kcoost ?

rookiemann commented 3 years ago

Hello, I saw the 'help wanted' sign and that you're Newbies Welcome. I'd like to help out however I can, very interested in NLP and have a basic understanding of ML. My experiences with ML are limited to a lot of reading on it and some experimental projects and tutorials here and there. I'm not bad in Python, very copy/paste coder but I can read it fine and can edit and modify it to fit. Thanks for reading.

goodgravy commented 3 years ago

I think Learning and Evaluating General Linguistic Intelligence answers this question, finding that BERT accuracy does degrade when you continue fine-tuning on other question-answering datasets.

StellaAthena commented 2 years ago

This is the central question of Effect of Scale on Catastrophic Forgetting in Neural Networks, now under review at ICLR. While there is some continued work I could see doing, I'm incline to close this issue based on this paper. Unless there's any serious objections people have to the paper?

goodgravy commented 2 years ago

@StellaAthena – @nicholasturner1 talked us through this paper in the reading group and I agree there's been a ton of progress here! I do think that the pretraining dataset-dependent results described in §5.4 would be a useful addition to the results in the ICLR paper. My take on this is something like "forgetting is less catastrophic in larger models, with more pre-training, on more varied data".

AFAIK no paper has said all three of those things in one place, but the two papers cover them all between them.

StellaAthena commented 2 years ago

@StellaAthena – @nicholasturner1 talked us through this paper in the reading group and I agree there's been a ton of progress here! I do think that the pretraining dataset-dependent results described in §5.4 would be a useful addition to the results in the ICLR paper. My take on this is something like "forgetting is less catastrophic in larger models, with more pre-training, on more varied data".

AFAIK no paper has said all three of those things in one place, but the two papers cover them all between them.

Interesting. Do you think we should close this as settled then? Or leave it open since the combo hasn’t been specifically addressed?

goodgravy commented 2 years ago

@StellaAthena yes, I think we should close this.

It prompts an interesting question about whether online learning really is "solved" now, but answering that is a different issue.