What was the pretraining split for the ProtT5-UniRef50 model?

agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.

Academic Free License v3.0

1.05k stars 150 forks source link

What was the pretraining split for the ProtT5-UniRef50 model? #151

Open speydril opened 2 weeks ago

speydril commented 2 weeks ago

do you by any chance still have the dataset split (train/val/test set) that was used to pretrain ProtT5 UniRef50? I am trying to investigate data leakage for down stream tasks.

mheinzinger commented 2 weeks ago

Hi, no, unfortunately, we do not have datasplits for this anymore as we considered the downstream prediction performance the acid test. Looking back, this was obviously a mistake. In order to still move forward on your end, you could take a time-cut-off of UniRef, i.e., extracting all sequences published after ProtT5 training, and redundancy reduce the newly added sequences against our training set (which will be a pain, sorry, as we also trained on BFD ... ).