UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.33k stars 2.48k forks source link

model distillation #507

Open ReySadeghi opened 4 years ago

ReySadeghi commented 4 years ago

hi, can I use "knowledge distillation" and "dimension reduction" for Bert-large? and if it is possible, for knowledge distillation how many layers should be remained in option2 ? and for dimension reduction which new size do you recommend for Bert-Large? thanks.

nreimers commented 4 years ago

Yes, you can also use them for BERT large.

Layers and dimension depends on what you need, i.e. you have a storage vs. performance trade-off (dimension) and a run-time vs. performance trade-off (layers).

ReySadeghi commented 4 years ago

i got this error in model_distilltaion.py , line "auto_model = student_model._first_module().model" error: torch.nn.modules.module.ModuleAttributeError: 'Transformer' object has no attribute 'model'

I loaded a finetuned BERT model use sentencetransformer().

nreimers commented 4 years ago

I think it has to be

auto_model = student_model._first_module().auto_model
ReySadeghi commented 4 years ago

thanks.

ReySadeghi commented 4 years ago

in case of model_distillation.py, I have finetuned my distilled model for 10 epochs, and used "SequentialEvaluator" consists of "MSEEvaluator" and "BinaryClassificationEvaluater", so I want to know the best model is saved depend of which evaluater? as I understood "BinaryClassificationEvaluater", save the best model depend on "cosin_avg_percision" and "MSEEvaluator" as in each epoch the loss reduces so the best model save on every epoch. is it true?

nreimers commented 4 years ago

You can pass a callable to SequentialEvaluator for the main_score_function parameter: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/SequentialEvaluator.py

By default, the score from the last evaluator is used to determine which model is saved. By setting main_score_function = lambda x: x[0] the score from the first evaluator would be used.