agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Bert pretraining approach #100

Closed shu1273 closed 1 year ago

shu1273 commented 1 year ago

@mheinzinger, Hi,

I want to pre-train ProtBert model from scratch which means Bert model training basically. Do you know what was Bert pre-training accuracy(not fine tuned) using 100-0-0 masking approach vs 80-10-10 approach. I could not get it anywhere. Basically I understand why 80-10-10 approach is implemented but did they do any experiments to figure this out. Please advise

mheinzinger commented 1 year ago

Hi, sorry, we never investigated any difference in the masking schema. We simply went for the parameters recommended in the original BERT implementation (which is 80-10-10 if I'm not mistaken but I would need to double check). That being said: I think the original version makes much sense for us because randomly replacing some amino acids in the input by other amino acids forces the model to not rely too much on what is given in the input, i.e., it teaches the model that inputs can be noisy. I think this is important for e.g. variant effect prediction where you do not mask one token at a time but simply feed in non-corrupted protein sequences and extract log-odds from reconstruction probabilities. If the model would be only trained on masked tokens (I think this is what you refer to with 100-0-0), it would probably just learn to copy input to output in such cases. Of course, it would be better to confirm this assumption but given the costly pre-training we simply sticked to the default parameters.