Closed hailong23-jin closed 3 years ago
Hi medlen, good catch! Sorry for not being verbose enough in your description. Due to some mappings that we perform from the NetSurfP-2.0 training set to PDB, we decided to drop proteins from the NetSurfP-2.0 training set that had become "Obsolete" between their release and our work. After all this only affects 0.4% of the training samples but we will try to include this in the update of our paper. The other difference is only be a typo. Thanks!
The original NetSurfP-2.0 dataset contains 10837 protein sequences. But only 10791 proteins were used in this paper. Is there any additional fitering operation here?
NetSurfP-2.0
ProtTrans
Besides, This is a small mistake. The paper said there are 10791 proteins for traning, but the file
Train_HHblits.csv
downloaded from https://www.dropbox.com/s/98hovta9qjmmiby/Train_HHblits.csv?dl=1 contains 10792 proteins. I found this link in https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb , downloadNetsurfpDataset().