cusbg / prankweb

Web application for protein-ligand binding sites analysis and visualization
https://prankweb.cz
Apache License 2.0
7 stars 3 forks source link

Conservation description #98

Closed davidhoksza closed 2 years ago

davidhoksza commented 2 years ago

Change the homology section in the Help page to "Conservation" and set the text to the following:

Besides selecting what protein to analyze, one can also specify whether evolutionary conservation should be included in the prediction model by checking the Use conservation checkbox. Note that calculating conservation score can increase the time of analysis.

The new conservation pipeline operates as follows. First, polypeptide chain sequences are extracted from the input file using P2Rank. The phmmer tool from the HMMER software package is then used to identify and align similar sequences for each respective query; UniRef50 Release 2021 03 is used as the single target sequence database. Up to 1,000 sequences are then randomly selected from each MSA to form the respective sample MSAs; weights are assigned to the individual sequences constituting the sample MSAs using the Gerstein/Sonnhammer/Chothia algorithm implemented in the esl-weight miniapp included with the HMMER software. Finally, per-column information content (i.e., conservation score) and gap character frequency values are calculated using the esl-alistat miniapp, taking the individual sequence weights into account; positions containing the gap character in more than 50% of sequences are masked to appear as possessing no conservation at all.

The range of the conservation corresponds to the range of the per-residue of information content which is between 0 and ~4 (log_2(20)) with higher values corresponding to higher conservation.