aewebb80 / VESPA

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses
GNU General Public License v3.0
14 stars 9 forks source link

Documentation Issues #10

Open AliSTaylor opened 6 years ago

AliSTaylor commented 6 years ago

Phase 1: Data Preparation

  1. State that TTN protein should be removed unless it is explicitly part of the analysis, due to it being a large protein with high similarity to virtually any protein due to its size. Include a script to remove TTN based on headers?

Clean Functions

  1. The Ensembl directed cleaning step is noted as 'clean_ensembl' in the manual, and as 'ensemble_clean' in the rundown of the commands when calling VESPA. This command is actually 'ensembl_clean'

  2. If you are using pre-created gene families or smaller sequences, the clean function may remove some files (i.e. small gene families with poor quality sequences). This results in empty files which cause issues when attempting to 'ensembl_clean' or 'create_database'. These give an error of incorrectly formatted sequences. It may be useful to include a check for empty files.

File Formatting

  1. The sequence headers should be edited post-cleaning for further analysis, such as aligning, comparing alignments, mapping back to nucleotide alignments and running CodeML. For Ensembl genes, the syntax of >Species_EnsemblGeneID|TranscriptID has worked.

    • You will need the common name when building trees
    • Transcript ID is needed during the nucleotide mapping phase, in case of alt transcripts.
    • CodeML will not accept headers over 30 characters, so it is best to simplify headers and then edit for CodeML.
  2. A nucleotide database should be created after cleaning and before translating. You will need a nucleotide database for the alignment mapping step.

Database Functions

  1. Make clear that this database is not BLAST-formatted and will have to undergo further outside steps to work with a BLAST. E.g. mkdir Blastdb cp database.fas Blastdb cd Blastdb makeblastdb -in database.fas -dbtype prot

Phase 2: Homology Searching

  1. Include why you might choose a particular homology searching option. Reciprocal has more support, being that each gene has to 'find' each other. What would the Similarity function or the Best_Reciprocal functions be suited for?

  2. It is not stated but you need to find the single gene orthologous families after creating the gene families for CodeML. Therefore it would be best to write a function that searches for gene families with only one member from each species.

  3. In addition, only gene families with 7+ members are informative for CodeML. Therefore, the single gene orthologous families have to be filtered based on their members. Currently, I search for the number of sequences, then move all with 7+ members to a new directory 'CodeML_Informative' and place those with 6 or less members in a 'CodeML_Uninformative' directory.

Phase 3: Alignment Assessment and Phylogeny Reconstruction

Alignment Comparison

  1. Sequence headers must match exactly. In addition, alignments must have the same number of amino acids as the alignment you are comparing to. Some alignment programs remove a terminal stop codon, whereas others do not. VESPA removes the terminal stop codon later in the pipeline, but having different numbers of amino acids will result in an error.

  2. Mafft and Muscle alignments compared well, when sequence headers were identical, followed the convention detailed above and had Mafft as the input with Muscle as the comparison.

ProtTest Setup

  1. ProtTest itself has poor documentation, making it a more difficult compilation than some. In addition, it requires internal builds such as PhyML. Perhaps a ProtTest setup guide would be useful.

  2. The prottest_setup output includes an ASCII file with the commands for 'prottest3'. This file will have to be edited on a user by user basis depending on how they call prottest on their system (especially if programs are in a group or central directory). In addition, the best way to run this file (i.e. on the command line, or calling it from a submission script) would be useful.

MrBayes Setup

  1. The Python dictionary used by VESPA to interpret models for phylogenetic reconstruction doesn't include all models supported by MrBayes 3, such as MtRev, CpRev and MtMam. In addition, it does not support model variation such as [model]+G/+F/+I.

Phase 4: Selection Analysis Preparation

Alignment Mapping Function

  1. This uses the nucleotide database which should have been created post-cleaning pre-translating the sequences.

  2. Terminal stop codons should be removed from the nucleotide database. They have been removed in the protein-coding sequences by VESPA. Therefore, this should be an option/function in the alignment mapping phase, otherwise the function will result in an error.

  3. The sequence headers should match exactly and contain a transcript ID. This prevents errors resulting from alternative transcripts in the original files and allows the protein coding sequences to be matched with the correct transcript

MrBayes Reader Function

  1. This should be underneath the MrBayes setup, not after the Gene Tree Inference and CodeML setup. The Gene Tree Inference and MrBayes steps should be presented together as alternatives.

  2. The Nexus to Newick conversion does not work. The MrBayes output is not recognised as a .nex file, either when using the full directory, or the consensus tree. Perhaps Dendropy may work here.