aqlaboratory / rgn

Recurrent Geometric Networks for end-to-end differentiable learning of protein structure
MIT License
326 stars 87 forks source link

Dataset and config used for the paper #9

Closed ecvgit closed 5 years ago

ecvgit commented 5 years ago

Couple of questions about the results in the paper.

  1. What are the parameters used for getting the results in the paper? Is it the one in https://github.com/aqlaboratory/rgn/blob/master/configurations/CASP11.config If yes, were all sequences > 700 removed from the test set? (maxSeqLength)

  2. Is it possible to get which sequences were used for the FM category and the ones for TBM category? Is it possible to get this info from proteinnet dataset?

alquraishi commented 5 years ago
  1. Yes the parameters for all the models are in the directory you mention. The validation set was pre-selected to exclude all sequences with >700 residues. For the test sets, all proteins in CASP12 and before were shorter than 700 I believe so there was no filtering needed. Having said that, it's possible to change the config file during prediction time to something greater than 700 residues to make predictions, although obviously performance may suffer as the model wasn't trained on longer proteins.

  2. Yes this information can be gleaned from the ProteinNet text-based records, since the entry IDs contain the CASP category. See here for more info.