Closed Archelunch closed 3 years ago
Hi! That's a very good question, actually :)
The short version is that it depends on what you want. If you want to preserve optimality guarantees, you probably want to extract another subarchitecture (for the sake of good science!), but you might not see too much difference if you just jump ahead and do KD with the Bort architecture (not the pretrained model, just the architecture). In fact, if you just need a fast LM that works best for your target language, I'd focus my efforts on the pre-training/fine-tuning.
Here's why: when I extracted Bort (the OSE/FPTAS step), I used the English RoBERTa and an English dataset. This means that the error is minimized over an English-based dataset. However, the output model (Bort) is untrained and the "error" minimization is just because the algorithm prefers faster-converging architectures. It probably will return the same result (or something very very close to it) if you change the dataset. Since it is an approximation algorithm, "very close" probably will mean you'll get the same answer for large enough approximation parameters. Indeed, when Danny pre-trained it with KD, he also used an English-based dataset. However, again, the fast convergence was due to the OSE/FPTAS step, so beyond some hyperparameter tuning I'd wager you'll be able to find similar speedups/results.
However, since Bort is such a small architecture, it will be very hard to fine-tune. We also have some heavy fine-tuning algorithms -- if I were creating Bort in some other language, here's where I'd focus 85% of my efforts.
Hope this helps!
PS: for KD, there are a few BERTs that are language-specific in Huggingface's transformers library.
Hi! If I want to train bort on another language, do I need first to pretrain bert on this language and then extract sub-model from it? Or can I just train bort from zero without pretrained bert?