Authors: Arnaud Felten, Déborah Merda
Affiliation: Food Safety Laboratory – ANSES Maisons Alfort (France)
You can find the latest version of the tool at https://github.com/afelten-Anses/NAuRA
HTML and PDF technical documentation are available in the 'docs/' directory.
This workflow called NAuRA for "Nice automated research of alleles" aims to detect genes or proteins based on a blast approach. If an alternative version of this gene/protein is detected, NAuRA extract its corresponding sequence and add this new version in the list of queries. Finally, NAuRA make a matrix where is specified which query and which allele is found for each genome.
NAuRA give also the possibility to perform a phylogenetic analysis by using a neighbor joining approach based on allele sequences.
The differents steps of the workflow are presented below :
conda config --add channels afelten
conda install naura
If it's necessary, make NAuRA excecutable :
chmod +x NAuRA
Add the scripts to your bashrc or bash_profile :
export PATH=$PATH:NAuRA/
Then you can run it as shell command :
NAuRA
NAuRA has been developped with python 2.7 (tested with 2.7.12).
Parameters of each scripts are available with one of its 3 options :
NAuRA
NAuRA -h
NAuRA --help
By default, NAuRA need protein sequences as queries. If Queries are nucleic sequences, the '--nucl' option must be set. NAuRA can't work with both protein and nucleic sequences simultaneously.
NAuRA need a queries file where is specified the path of all queries fasta files. Queries must be in a separated fasta file and the query header must end with "_1" (see example below). This allows NAuRA to detect the intial query and to increment the allele number.
>queryA_1
SIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPST
ETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGL
GKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLI
Optionally, a specific minimum of coverage (column 1) and/or minimum of identity (column 2) can be setting for each query. If no value are is specified, values given by '-pl' and/or '-ph' arguments are setting by default. Values must be separated by a tab character.
/data/myProject/queryA.fasta
/data/myProject/queryB.fasta 90 90
/data/myProject/queryC.fasta 95
/data/myProject/queryD.fasta 70
NAuRA make a matrix file in TSV (tabular separator value) format. For each analyzed genome and each query, a value is associated :
NAuRA stores new alleles in each query fasta file given by the queries file. It's possible to rerun NAuRA on a new dataset of genomes with all of theses alleles.
It's possible to keep the filtered blast outputs file with the '--keepBlastAln' option (one blast file per genome).
If the '--withPhylo' option is given, NAuRA makes additional output files :
You can test NAuRA with the command lines :
cd test
NAuRA -i genomes -q list_queries.txt -T 1 --withPhylo --keepBlastAln