afelten-Anses / NAuRA

Nice Automatic Research of Alleles
GNU General Public License v3.0
4 stars 1 forks source link

NAuRA README

Authors: Arnaud Felten, Déborah Merda

Affiliation: Food Safety Laboratory – ANSES Maisons Alfort (France)

You can find the latest version of the tool at https://github.com/afelten-Anses/NAuRA

HTML and PDF technical documentation are available in the 'docs/' directory.

NAuRA workflow

This workflow called NAuRA for "Nice automated research of alleles" aims to detect genes or proteins based on a blast approach. If an alternative version of this gene/protein is detected, NAuRA extract its corresponding sequence and add this new version in the list of queries. Finally, NAuRA make a matrix where is specified which query and which allele is found for each genome.

NAuRA give also the possibility to perform a phylogenetic analysis by using a neighbor joining approach based on allele sequences.

The differents steps of the workflow are presented below :

Quick Start

Usage (Linux/Mac OS X)

Install with conda

conda config --add channels afelten
conda install naura

Usage (Linux/Mac OS X)

If it's necessary, make NAuRA excecutable :

chmod +x NAuRA

Add the scripts to your bashrc or bash_profile :

export PATH=$PATH:NAuRA/

Then you can run it as shell command :

NAuRA

Dependencies

NAuRA has been developped with python 2.7 (tested with 2.7.12).

External dependencies

Parameters

Parameters of each scripts are available with one of its 3 options :

NAuRA
NAuRA -h
NAuRA --help

NAuRA parameters

NAuRA options

Important

By default, NAuRA need protein sequences as queries. If Queries are nucleic sequences, the '--nucl' option must be set. NAuRA can't work with both protein and nucleic sequences simultaneously.

Queries file

NAuRA need a queries file where is specified the path of all queries fasta files. Queries must be in a separated fasta file and the query header must end with "_1" (see example below). This allows NAuRA to detect the intial query and to increment the allele number.

>queryA_1
SIPFNANWLNQQYAEIIQAILFDVVGYEVKPHFITTEELANYSNNETATPKETTKPST
ETTEDNHVLGREQFNAHNTFDTFVIGPGNRFPHAASLAVAEAPAKAYNPLFIYGGVGL
GKTHLMHAIGHHVLDNNPDAKVIYTSSEKFTNEFIKSIRDNEGEAFRERYRNIDVLLI

Optionally, a specific minimum of coverage (column 1) and/or minimum of identity (column 2) can be setting for each query. If no value are is specified, values given by '-pl' and/or '-ph' arguments are setting by default. Values must be separated by a tab character.

/data/myProject/queryA.fasta
/data/myProject/queryB.fasta    90  90
/data/myProject/queryC.fasta    95  
/data/myProject/queryD.fasta        70

Ouputs

NAuRA make a matrix file in TSV (tabular separator value) format. For each analyzed genome and each query, a value is associated :

NAuRA stores new alleles in each query fasta file given by the queries file. It's possible to rerun NAuRA on a new dataset of genomes with all of theses alleles.

It's possible to keep the filtered blast outputs file with the '--keepBlastAln' option (one blast file per genome).

If the '--withPhylo' option is given, NAuRA makes additional output files :

Test

You can test NAuRA with the command lines :

cd test
NAuRA -i genomes -q list_queries.txt -T 1 --withPhylo --keepBlastAln