NAuRA README

Authors: Arnaud Felten, Déborah Merda

Affiliation: Food Safety Laboratory – ANSES Maisons Alfort (France)

You can find the latest version of the tool at https://github.com/afelten-Anses/NAuRA

HTML and PDF technical documentation are available in the 'docs/' directory.

NAuRA workflow

This workflow called NAuRA for "Nice automated research of alleles" aims to detect genes or proteins based on a blast approach. If an alternative version of this gene/protein is detected, NAuRA extract its corresponding sequence and add this new version in the list of queries. Finally, NAuRA make a matrix where is specified which query and which allele is found for each genome.

NAuRA give also the possibility to perform a phylogenetic analysis by using a neighbor joining approach based on allele sequences.

The differents steps of the workflow are presented below :

Quick Start

Usage (Linux/Mac OS X)

Install with conda

conda config --add channels afelten
conda install naura

Usage (Linux/Mac OS X)

If it's necessary, make NAuRA excecutable :

chmod +x NAuRA

Add the scripts to your bashrc or bash_profile :

export PATH=$PATH:NAuRA/

Then you can run it as shell command :

NAuRA

Dependencies

NAuRA has been developped with python 2.7 (tested with 2.7.12).

External dependencies

Biopython tested with 1.70
blast+ - tested with 2.2.31+
Fastx-toolkit - tested with 0.0.14
pairdist - tested with 1.0
clustalo - tested with 2.1
Dendropy (sumtrees.py) - tested with 4.3.0

Parameters

Parameters of each scripts are available with one of its 3 options :

NAuRA
NAuRA -h
NAuRA --help

NAuRA parameters

-i : Folder with GBK files, theses files must be in the format 'GENOMEid.gbk' (REQUIRED)
-m : Matrix file if exists (default:matrix.tsv)
-q : Text file with query Fasta path, one per line (REQUIRED). Optional, set cov and id percent separated by tabulation. Add '1' for each query. No supplementary '' character. Each query file name must have the same name as the allele name.")
-l : Text file with the already analyzed genomes list (default:list.txt)
-pl : Minimum percent of alignment coverage (default:80)
-ph : Minimum percent of alignment identity (default:80)
-T : Number of threads to use (default:2)
-b : Number of bootstrap, only with --withPhylo option (default:1)

NAuRA options

--nucl : Specify queries are nucleic sequences
--withPhylo : Do the phylogeny analysis of new alleles
--keepBlastAln : Keep blast results for each genome
--noDrift : Similarity and coverage always tested with default allele (take longer, recommended if queries sequences are close)

Important

By default, NAuRA need protein sequences as queries. If Queries are nucleic sequences, the '--nucl' option must be set. NAuRA can't work with both protein and nucleic sequences simultaneously.

Queries file

NAuRA need a queries file where is specified the path of all queries fasta files. Queries must be in a separated fasta file and the query header must end with "_1" (see example below). This allows NAuRA to detect the intial query and to increment the allele number.

>queryA_1
SIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPST
ETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGL
GKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLI

Optionally, a specific minimum of coverage (column 1) and/or minimum of identity (column 2) can be setting for each query. If no value are is specified, values given by '-pl' and/or '-ph' arguments are setting by default. Values must be separated by a tab character.

/data/myProject/queryA.fasta
/data/myProject/queryB.fasta    90  90
/data/myProject/queryC.fasta    95  
/data/myProject/queryD.fasta        70

Ouputs

NAuRA make a matrix file in TSV (tabular separator value) format. For each analyzed genome and each query, a value is associated :

'0' if the query isn't found for this genome ;
'1' if the query found is the reference allele ;
the allele number otherwise.

NAuRA stores new alleles in each query fasta file given by the queries file. It's possible to rerun NAuRA on a new dataset of genomes with all of theses alleles.

It's possible to keep the filtered blast outputs file with the '--keepBlastAln' option (one blast file per genome).

If the '--withPhylo' option is given, NAuRA makes additional output files :

'queries_alignment.fasta' is a fasta file where all queries are aligned by clustalo ;
'queries_alignment.tree' is a newick file with all boostrap obtained by pairdist ;
'queries_alignment.tree' is the consensus tree in newick format obtained by sumtree.

Test

You can test NAuRA with the command lines :

cd test
NAuRA -i genomes -q list_queries.txt -T 1 --withPhylo --keepBlastAln

afelten-Anses / NAuRA

readme

NAuRA README

NAuRA workflow

Quick Start

Usage (Linux/Mac OS X)

Install with conda

Usage (Linux/Mac OS X)

Dependencies

External dependencies

Parameters

NAuRA parameters

NAuRA options

Important

Queries file

Ouputs

Test