agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.14k stars 154 forks source link

Why is the number of proteins used for per-residue prediction different from original NetSurfP-2.0 dataset. #38

Closed hailong23-jin closed 3 years ago

hailong23-jin commented 3 years ago

The original NetSurfP-2.0 dataset contains 10837 protein sequences. But only 10791 proteins were used in this paper. Is there any additional fitering operation here?

NetSurfP-2.0

A structural dataset consisting of 12 185 crystal structures was obtained from the Protein Data Bank (PDB),22 culled and selected by the PISCES server with 25% sequence similarity clustering threshold and a resolution of 2.5 Å or better. To avoid over fitting, any sequence that had more than 25% identity to any sequences in the test datasets (see “Evaluation” section for details) was removed, as well as peptide chains with less than 20 residues, leaving 10 837 sequences.

ProtTrans

Per-residue prediction: When predicting properties on the level of single residues, the data set published alongside NetSurfP-2.0 [25] was used for 3- and 8-state secondary structure prediction. The NetSurfP-2.0 dataset was created through PISCES [40] selecting highest resolution protein structures (resolution <=2.5A) from the PDB [41]. The set was redundancy-reduced such that no pair of proteins had 25% pairwise sequence identity (PIDE), leaving 10791 proteins to train.

Besides, This is a small mistake. The paper said there are 10791 proteins for traning, but the file Train_HHblits.csv downloaded from https://www.dropbox.com/s/98hovta9qjmmiby/Train_HHblits.csv?dl=1 contains 10792 proteins. I found this link in https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb , downloadNetsurfpDataset().

mheinzinger commented 3 years ago

Hi medlen, good catch! Sorry for not being verbose enough in your description. Due to some mappings that we perform from the NetSurfP-2.0 training set to PDB, we decided to drop proteins from the NetSurfP-2.0 training set that had become "Obsolete" between their release and our work. After all this only affects 0.4% of the training samples but we will try to include this in the update of our paper. The other difference is only be a typo. Thanks!