Open abis330 opened 5 months ago
cc @nreimers
I can't say for certain, but my suspicion is that this model is literally just the https://huggingface.co/microsoft/MiniLM-L12-H384-uncased model but with every second layer removed, i.e. no distillation.
Did you arrive at this model by performing "deep self-attention distillation" by using "microsoft/MiniLM-L12-H384-uncased" as a teacher assistant (which was derived as a student of UniLMv2 as per the paper MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers or by directly removing every second layer from the already achieved student model of "microsoft/MiniLM-L12-H384-uncased"?
It isn't exactly clear to me. Could you please confirm?