bjornwallner/ProQ_scripts

PROQ SCRIPTS Contact: bjornw@ifm.liu.se git: https://github.com/bjornwallner/ProQ_scripts.git

ProQ is Model Quality Assessment Program that predictis the quality of individual models. To do this ProQ uses both information that can be calculated from the 3D coordinates of the model as well as information that can be predicted from the sequence. Of course the information that comes from the sequence is the same for all models of the same sequence. Thus, before scoring models the sequence specific features needs to be calculated. The ProQ_scripts folder contains the scripts and programs needed to generate the sequence specific input features to the ProQ2 and ProQM scoring functions in Rosetta.

INSTALLATION of ProQ2/ProQM scoring functions

1) git clone https://github.com/bjornwallner/ProQ_scripts 2) Download the latest weekly release of Rosetta from https://www.rosettacommons.org/software/, you need a license but it's free for academia. 3) compile rosetta according to the instructions.

INSTALLATION of ProQ_scripts to calculate sequence specific features

To install the programs to calculate all sequence specific features you need to do the following steps

1) Set the $DB variable in bin/run_all_external.pl to a formated sequence database e.g. uniref90. (e.g. download the fasta sequences and run formatdb from the blast package) 2) run ./configure.pl

TEST To check that everything is set up run ./run_test.sh This should create all the files needed without any error messages.

RUNNING Once set up, the main wrapper program 'bin/run_all_external.pl' will run everything using either a fasta file or extracting the sequence from a pdb

bash$ bin/run_all_external.pl Usage: run_all_external.pl -pdb [pdbfile] -fasta [fastafile] -membrane 1 (if membrane protein) -overwrite 1 (if overwrite)

For water soluable proteins run: bin/run_all_external.pl -pdb OR bin/run_all_external.pl -fasta

For membrane proteins run: bin/run_all_external.pl -pdb -membrane 1 OR bin/run_all_external.pl -fasta -membrane 1

To use the sequence specific files in rosetta you need to use the flag -ProQ:basename , where is either the input pdbfile or input fastafile used to generate the sequence specific features. All features have the following naming: basename.*, e.g. basename.ss2 for secondary structure prediction or basename.mtx for psiblast profile etc.

Currently there are no functionality in Rosetta to account for models that have missing residues. These have to be handled outside, by first creating the sequence features for the full length sequence and than use the script 'copy_features_from_master.pl'. For the script to work you need to install the global alignment program 'needle' in the EMBOSS package (http://emboss.sourceforge.net/, or sudo apt-get install emboss)

The following example will create all sequence specifici features for the full.seq sequence, and than copy the relevant features to the model1-5.pdb, by aligning the sequences of model1-5.pdb to full.seq.

./run_all_external.pl -fasta full.seq ./copy_features_from_master.pl model1.pdb full.seq ./copy_features_from_master.pl model2.pdb full.seq ./copy_features_from_master.pl model3.pdb full.seq ./copy_features_from_master.pl model4.pdb full.seq ./copy_features_from_master.pl model5.pdb full.seq

RUNNING ProQ or ProQM on a pdb with the name 1abc.pdb

1) Create the sequence specific features as described above. You do this once for a given sequence. i.e. bin/run_all_external.pl -pdb 1abc.pdb, will create files with the basename 1abc.pdb For membrane proteins bin/run_all_external.pl -pdb 1abc.pdb -membrane 1

2a) to score all .pdb run: score.linuxgccrelease -database -in:file:fullatom -ProQ:basename 1abc.pdb -in:file:s .pdb -out:file:scorefile ProQ.sc -score:weights ProQ2 For membrane protein: score.linuxgccrelease -database -in:file:fullatom -ProQ:basename 1abc.pdb -in:file:s *.pdb -out:file:scorefile ProQM.sc -score:weights ProQM -ProQ:membrane

This will produce a scorefile with the a global score corresponding to summed local scores. To get the usual ProQ/ProQM score divide by length or run the scoring with -ProQ:normalize to get normalized scores directly.

To get the local scores run the scoring with -ProQ:output_local_prediction, this will generate a local quality estimates with the name ProQM.

2b) Side-chain sampling before score

USING PROQ2 (NON-MEMBRANE) Generate 10 different side-chain conformations for each pdb: relax.linuxgccrelease -database -in:file:fullatom -out:file:silent_struct_type binary -nstruct 10 -relax_script /resampling/repack.script -in:file:s *.pdb -out:file:silent ProQ2.repacked.silent

Score each of these with ProQ2: score.linuxgccrelease -database -in:file:fullatom -score:weights ProQ2 -in:file:silent ProQ2.repacked.silent -out:file:scorefile ProQ2.repacked.silent.score

Select the model with the highest ProQ2 score.

USING PROQM (MEMBRANE) Generate 10 different side-chain conformations for each pdb: relax.linuxgccrelease -database -in:file:fullatom -out:file:silent_struct_type binary -relax:membrane -membrane:Membed_init -score:weights membrane_highres.wts -in:file:spanfile $basename.span -nstruct 10 -relax_script /resampling/repack.script -in:file:s *.pdb -out:file:silent ProQM.repacked.silent

Score each of these with ProQM: score.linuxgccrelease -database -in:file:fullatom -ProQ:membrane -score:weights ProQM -in:file:silent ProQM.repacked.silent -out:file:scorefile ProQM.repacked.silent.score

Select the model with the highest ProQM score.

2c) ProQM-resample Follow the instruction above to install ProQ_scripts

The command: resampling/resampling.pl <any pdb in the folder (this is only used to get the sequence)>

will do 1+2b (calculate sequence specific features+resampling+scoring) on all .pdb in the folder from where it is executed. All .pdb needs to be for the same sequence.

bjornwallner / ProQ_scripts

readme