For packages, software, libraries required for the following code: pip install -r requirements.txt
Note: In order to execute the following the dickeya.db, dickeya_cds_aa.fasta and dickeya_cds_nt.fasta are required
The first step is to obtain the input sequence data. Those data derive from CAZy. A python script named cazy_uniprot_dickeya.py (./bin) allows access to UniProt and returns data from CAZy based on a specific family which we can pre define.
The database has been specified to be CAZy and the taxonomy to be Dickeya.
By executing python3 cazy_uniprot_dickeya.py (navigating to the ./bin) generates a new directory under the ./data directory named as ./data/cazy_dickeya and stores all CAZy sequences for Dickeya, separated per CAZy family and stored under the appropriate name.
The script:
[1] Retrieve all Dickeya entries with a cross-reference to CAZy Retrieve in tabular format: accession CAZy_xrefs
[2] Create a dictionary mapping CAZy families to accessions e.g. {GH3: [P11073, D2BXL2, ...], ...}
We use a default dict for this. As caz
is just a string we use io.StringIO to handle it like a file whose lines we iterate over. As the first line contains the column headers, we ignore it. Each line is split into accession and a list of CAZy families; from this we construct the dictionary.
[3] For each CAZy family, retrieve the mapped UniProtKB accessions in fasta format and write to a file with name
Run python3 process_cazy_data.py
The following were performed in order to process the data retrieved by Uniprot:
However, Uniprot failed to detect and downlowland a couple of CAZy families with Dickeya entries. Those are the following CAZyy FAMILIES: CE8, CE1, CE4, CE9, CE11, CE12, PL26. For those CAZY families the sequences and locus tags were downloaded manually by accessing CAZy, The locus tags for the CAZy families were not detecting by accessing Uniprot were merged into a unique text files names as locus_Dickeya_not_det.txt. We merge the txt files by running python3 merge_locus.py Then we obtain the RBBH for those locus tags by running the RBBH_not_det_uniprot.py python script. the script runs from within the process_cazy_data.py script.
The next step was to to make sure that the files containing the RBBH output result fasta files do not contain any duplicate sequences. We can do that by looping over all fasta files in each CAZy directory first, then adding in a list all record.id for the first file and writing those sequences in a new file named after the CAZy family. Then we check all other fasta files stored within the directory for record.id which are not in the list already. If there are not in the list then we append them to the new fasta file and then append them to the list too. If they already exist in the list then we skip those sequences.
The new fasta files containing unique sequences after RBBH analysis and named after the CAZy family they belong to were stored under the data/cazy_rbbh directory.
As some of those families have quite a few representatives enzymes, the next step was to split those families based on the RBBH table from the dickeya database. By running the split_CAZy_families_1.py we can use sys.argv[] to automate the process for all large CAZy families. Finally, a python script named as automation_split_CAZy_families_1.py was created to pass as arguments all individual CAZy families. The code run as follows:
$ python3 automation_split_CAZy_families_1.py The script generates the following :
Some CAZy families include sequences which are very diverse in comparison to the other sequences and therefore in order to obtain a more reliable output indicating positive selection we will have to exclude those sequences form the MSA. We do that for four CAZy families by running the remove_ids.py python script.
Run rename_dies.py and rename_dirs_2.py to rename the directories and merge data.
a. Generate Input for codeml
The next step is to obtain MSA and phylogenetic trees using RaxML in order to use those data as input for positive selection analysis. We generate those data by running the automation_repeat_split_families.py which runs the ps_automation_repeat_split_families.sh shell script and the ps_automation.py which runs the ps_automation.sh shell script.
Both shell scripts invoke several python scripts in order to obtain the back translations, MSA of protein and nucleotides sequences , different formats of the MSA'S (FASTA, CLUSTAL, RPHYLIP).
Split the nucleotide sequence into petition, generate phylogenetic trees using RaxML, splitting the trees into subtrees by alternatively assigning a different branch node each time, modify the rphylip MSA in order to be readable by RaxML and organise teach repo in that way that we get access to the phylogenetic tree, control file for CodemML and MSA.
Finally, the generated input data were transferred otto the local cluster.
b. Cluster
Shell scripts were written for each CAZy family and subgroup in order to run codeml for all trees within each CAZy family/subgroup for the alternative and null model. Finally, a codeml mlc output file was generated for each tree in within each CAZy family and stored under the corresponding tree directory storing information about the maximum likelihood
Note that in order to analyse the data some of the code request ETE and therefore need to be installed in the directory in advance.
For those families/treedirs which the codeml output was not complete then I run: bash codeml.sh $CAZYFAMILY or navigate within the directory the data is missing and simply run: codeml and the type of control file we are providing (depending if it the null or alternative model).
The next step was to combine the two columns about the CAZy family name and the group into one running the combine_col.r code and the output file is named as codeml_results_split_COMB.csv
The two files (codeml_results_split_COMB.csv and codeml_results.csv ) were bind into one by running the bind.R code #1 and names as codeml_bind_data.csv
Next step was to split the codeml_bind_data.csv into individual families in order to calculate the fdr and q values. I achieved that by running the split.R code. The cvs files were stored under the split_fams working directory. Current location being PS directory
The fdr values were then calculated by running the fdr.R code which invokes the fdr_calc from within the fdr_function.R script. The cvs files including the fdr values were stored under the fdr directory. Current directory split_fams
The csv's were combined into one cvs named as codeml_bind_data_fdr.csv using the bind.R code #2 Current working directory ./PS/frd. We move the codeml_bind_data_fdr.csv to PS directory.
Then, the qvalues were obtained for each family by running the qvals.R script. the script invokes the qval_calc function from within the qvalue_function.R script. The input files are the csv's from the fdr directory
The csv files with the fdr and qvalues were combined into one by running the #3 from the bind.R code and the and the file named as codeml_bind_data_qvals.csv Current directory qvalue. Move data in PS directory.
Finally, the codeml_bind_data_qvals.csv can be filtered based on either the q values of the fdr values. The fdr.R and qvalue.R include code for filtering the data.
Before processing the codeml_bind_data_qvals.csv, we need to remove those families/trees/ branch sites which are marked externally. We do that through the ranking.py script. a. In order to get the labelled #1 branches we use ETE. The python script leaves.py does 3 things: Remove any external labelled nodes, get the labelled nodes and create a dictionary with CAZy_family : leaves in order to generate heatmaps etc of the positive selected cazy families and the nodes they seem to be under positive selection.
Firstly, I remove all those entries which are labelled externally by using the external_nodes function from the leaves.py. Running the first 5 cells form within the ranking.ipynb
The code writes out to a cvs file named as no_external_nodes_ps.csv those families/trees which are labelled internally
Next step is to filter those families by using the filter_data.r R code and filtering based on filter(log(qvalue) < -20).
The filtered_cazy_fams cvs file was obtained by running the filter_data.R script based on no_external_nodes_ps.csv. The cvs contains all the CAzy families which are above the threshold I have preselected and also provides information for the amount of times the CAZy family has been observed.
A cvs file named as final_ps.csv was generated by running the filter_data.R including all families with qvals etc with the preselected threshold
We make the final cvs a data frame (ranking.ipynb) and create a dictionary using the cazy_leaves function (from the leaves.py script) to get the CAZy families and the labelled leaves.
We make the leaves into groups so we can have access and finally write out in a cvs a file named positive_selected_qvals.csv which shows the CAZy families and groups are above the preselected threshold Finally, a heat map using seaboard was generated to represent those positive elected families and the png was named as heatmap_full_analysis.png