Bert pretraining approach

Hi, sorry, we never investigated any difference in the masking schema. We simply went for the parameters recommended in the original BERT implementation (which is 80-10-10 if I'm not mistaken but I would need to double check). That being said: I think the original version makes much sense for us because randomly replacing some amino acids in the input by other amino acids forces the model to not rely too much on what is given in the input, i.e., it teaches the model that inputs can be noisy. I think this is important for e.g. variant effect prediction where you do not mask one token at a time but simply feed in non-corrupted protein sequences and extract log-odds from reconstruction probabilities. If the model would be only trained on masked tokens (I think this is what you refer to with 100-0-0), it would probably just learn to copy input to output in such cases. Of course, it would be better to confirm this assumption but given the costly pre-training we simply sticked to the default parameters.

agemagician / ProtTrans

Bert pretraining approach #100