Why is the number of proteins used for per-residue prediction different from original NetSurfP-2.0 dataset.

The original NetSurfP-2.0 dataset contains 10837 protein sequences. But only 10791 proteins were used in this paper. Is there any additional fitering operation here?

NetSurfP-2.0

A structural dataset consisting of 12 185 crystal structures was obtained from the Protein Data Bank (PDB),22 culled and selected by the PISCES server with 25% sequence similarity clustering threshold and a resolution of 2.5 Å or better. To avoid over fitting, any sequence that had more than 25% identity to any sequences in the test datasets (see “Evaluation” section for details) was removed, as well as peptide chains with less than 20 residues, leaving 10 837 sequences.

ProtTrans

Per-residue prediction: When predicting properties on the level of single residues, the data set published alongside NetSurfP-2.0 [25] was used for 3- and 8-state secondary structure prediction. The NetSurfP-2.0 dataset was created through PISCES [40] selecting highest resolution protein structures (resolution <=2.5A) from the PDB [41]. The set was redundancy-reduced such that no pair of proteins had 25% pairwise sequence identity (PIDE), leaving 10791 proteins to train.

Besides, This is a small mistake. The paper said there are 10791 proteins for traning, but the file Train_HHblits.csv downloaded from https://www.dropbox.com/s/98hovta9qjmmiby/Train_HHblits.csv?dl=1 contains 10792 proteins. I found this link in https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTune-SS3.ipynb , downloadNetsurfpDataset().

agemagician / ProtTrans

Why is the number of proteins used for per-residue prediction different from original NetSurfP-2.0 dataset. #38