Open speydril opened 5 months ago
Hi, no, unfortunately, we do not have datasplits for this anymore as we considered the downstream prediction performance the acid test. Looking back, this was obviously a mistake. In order to still move forward on your end, you could take a time-cut-off of UniRef, i.e., extracting all sequences published after ProtT5 training, and redundancy reduce the newly added sequences against our training set (which will be a pain, sorry, as we also trained on BFD ... ).
do you by any chance still have the dataset split (train/val/test set) that was used to pretrain ProtT5 UniRef50? I am trying to investigate data leakage for down stream tasks.