Knowledge transfering/distillation using pseudo label

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.16k stars 627 forks source link

Knowledge transfering/distillation using pseudo label #121

Closed Xinxinatg closed 3 years ago

Xinxinatg commented 3 years ago

I am trying to transfer the knowledge in the pre-trained model to one with much smaller amount of parameters with pseudo label (binary label). After investigating into the details of the model, I found the logits generated by the model is related to the variant prediction task. Just wondering will this pseudo label be infomative enough to do a general knowledge transfering. Or could someone suggest a better way to generate pseudo labels of binary classification alike.

Licko0909 commented 3 years ago

I am trying to transfer the knowledge in the pre-trained model to one with much smaller amount of parameters with pseudo label (binary label). After investigating into the details of the model, I found the logits generated by the model is related to the variant prediction task. Just wondering will this pseudo label be infomative enough to do a general knowledge transfering. Or could someone suggest a better way to generate pseudo labels of binary classification alike.

I have the same question!

tomsercu commented 3 years ago

I want to make sure I understand the question: is the goal to distill (a) a binary classifier on top of the pre-trained models? Full-protein classification? Or a per-amino acid classification task? Or (b) the full pre-trained model, ie distilling the masked language model? Both seem like a good idea but (b) is more compute intensive. In (a) you can consider generating pseudo-labels on a larger (unlabeled) dataset but beware for degradation if it is out of distribution for your labeled training set.

We're aware of the community's ask for a smaller (distilled) model ie (b) and will eventually get to this. For this, the default approach will be to do MLM distillation of the logits over the amino acid vocab, trained on the full uniref50 dataset.

Xinxinatg commented 3 years ago

@tomsercu Thanks for reply! (a) Yep I want to transfer the knowledge from pretrained model (ESM) to a much smaller classifier model on each of individual tasks. Right now I am focusing on full protein classification, but potentially it can be extended to per-amino acid tasks; (b) MLM distillation would be too much for me at this moment given my limited computation resources. But I am testing out an model via intermediate layer distillation now on pseudo labelled dataset. It is distilled on 500 K protein sequences. Will report the progress if any meaningful results are generated

tomsercu commented 3 years ago

@Xinxinatg - let me know if any remaining questions!