Eco-Flow / synteny

A Nextflow pipeline for running synteny analysis.
Other
13 stars 3 forks source link

nf-synteny

A simple pipeline to run a macro synteny analysis.

It is under development, so if you wish to use the pipeline for your own research, please contact us (ecoflow.ucl [at] gmail.com). We can give you the up to date detail for the methods and the up to date figures. When it is published we will release a final version.

Synteny is the study of chromosome arrangement and gene order. Over evolutionary time, two species diverge from the state of the common ancestor, due to a variety of structural changes. These include indels, inversions, translocations, fusions and fissions. This pipeline aims to produce common synteny plots, as well as tables documenting the types of syntenic changes.

The pipeline takes a csv (comma separated value) file as input, which contains the species you wish to compare followed by their RefSeq ID. Genomes must be chromosome level assemblies, with a maximum of 50 chromosomes/scaffolds.

The main pipeline logic is as follows:

image info

Tutorial

We have a short tutorial to help you test and explore the pipeline.

Installation

Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.

Prerequisites

Install

To install the pipeline please use the following commands but replace VERSION with a release.

wget https://github.com/Eco-Flow/synteny/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -

or

curl -L https://github.com/Eco-Flow/synteny/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -

This will produce a directory in the current directory called synteny-VERSION which contains the pipeline.

Inputs

Required

This csv can take 2 forms:

Please Note: The genome has to be chromosome level not contig level.

2 fields (Name,Refseq_ID):

Drosophila_yakuba,GCF_016746365.2
Drosophila_simulans,GCF_016746395.2
Drosophila_santomea,GCF_016746245.2

3 fields (Name,genome.fna,annotation.gff):

Drosophila_yakuba,data/Drosophila_yakuba/genome.fna.gz,data/Drosophila_yakuba/genomic.gff.gz
Drosophila_simulans,data/Drosophila_simulans/genome.fna.gz,data/Drosophila_simulans/genomic.gff.gz
Drosophila_santomea,data/Drosophila_santomea/genome.fna.gz,data/Drosophila_santomea/genomic.gff.gz

Optional

(Default: --no_strip_names).

Profiles

This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2.

Container Profiles

Please select one of the following profiles when running the pipeline.

Optional Profiles

Custom Configuration

If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.

Running the Pipeline

Please note: The -resume flag uses previously cached successful runs of the pipeline.

Results

Once completed, your output directory should be called Results, unless you specified another name:

Subdirectories:

Figures

  1. Karyotype_plots - Karyotype plots of each pairwise comparison.(.karyotype.pdf). Showing a 1 to 1 chromosome mapping with lines drawn between syntenic chromosomes.
  2. Dotplot - (.pdf). Showing the chromosome synteny as a dot plot.
  3. Depth_plot - (.depth.pdf). Percentage of genome that correspond to non-orthlogous (0), 1to1 or 1toMany orthologs detected.
  4. Painted_chromosomes - (.chromo.pdf).Showing on graphic chromosomes, which sections are syntenic between two species in colours.

Data

  1. Gffread - Species gene fasta files (.nucl.fa), plus reformatted gff files (.gff_for_jvci.gff3).
  2. Anchors - (.anchors). Anchor files documenting the MSCanX genes in syntenic blocks. Using the lifted function from JCVI.
  3. Last - Filtered last results for each pairwise run. Filtered using default settings from JCVI.

Tables

  1. Trans_Inversion_junction_merged.txt - A summary of the types of syntenic break between sets of anchors.
  2. Paired_anchor_change_junction_prediction - A folder with each pairwise analysis of junction changes between syntenic blocks.
  3. My_scores.tsv - A table (pairwise) of number of syntenic gene pairs, as well as the max and average syntenic block length (in numbers of genes)
  4. Synteny_matrix.tsv - A Matrix of syntenic gene pair totals (pairwise).
  5. Trans_location_version.out.txt - A Table of scores (pairwise), documenting numbers of scaffolds, syntenic block, genes, as well as a variety of scores.
  6. Synt_gene_scores - A folder with pairwise gene scores. Scores are based on the distance to nearest syntenic break. Where '1' means a gene in on the edge of a syntenic block.
  7. My_sim_cores.tsv - A Matrix containing nucleotide percentage similarities.
  8. My_comp_synteny_similarity.tsv - A Matrix containing pairwise nucleotide percentages and total number of syntenic genes.

All of the pipeline run information can be found inside pipeline_info.

Citation

This pipeline is not yet published. If you use this pipeline for your research please cite the main tool set we use (JCVI):

"Tang et al. (2008) Synteny and Collinearity in Plant Genomes. Science".

Ensure you record the release of the pipeline that you ran, as versions will change over time, so it is important to record exact releases.

Contact Us

If you need any support do not hesitate to contact us at any of:

ecoflow.ucl [at] gmail.com

c.wyatt [at] ucl.ac.uk