alekseyzimin / masurca

GNU General Public License v3.0
245 stars 35 forks source link

duplicated regions in the assembly #210

Open olivierarmant opened 3 years ago

olivierarmant commented 3 years ago

Dear Alekseyzimin, Thanx a lot for producing masurca and for your continuous efforts for the comunity. :) I used masurca for a vertebrate genome using PE (250 bp and 100bp), MP and few nanopore sequences. The assembly went fine. I have a question regarding the presence of potential replicated regions in the assembly. Indeed by checking the scaffold assembled for the MT chromosom I did find that a large region (> 5kb) is duplicated (including the NADH gene). It is unlikely that this is real and I suppose this is an artefact. I would like to know if it is possible to tune masurca in order to be more stringent and avoid such replicated regions? Or could you suggest Tools to polish the assembly and remove such artefacts. Thanx in advance for your help! Below is the config file I used

DATA

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

PE= p1 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz PE= p2 550 250 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz PE= p3 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_350PE.R2.fastq.gz PE= p4 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubIISample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubIISample_Y2_350PE.R2.fastq.gz PE= p5 550 250 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_550PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_550PE.R2.fastq.gz

Illumina mate pair reads supplied as

JUMP= m1 4000 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz

JUMP= m2 3000 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz

JUMP= m3 1200 700 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_gelfree_250bp.2.fq.gz JUMP= m4 3000 700 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz JUMP= m5 8000 3000 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz

JUMP= m6 2500 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz

pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

PACBIO=/FULL_PATH/pacbio.fa

NANOPORE=/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Nanopore.fastq.gz

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many

OTHER=/FULL_PATH/file.frg

synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data

REFERENCE=/FULL_PATH/nanopore.fa

END

PARAMETERS

PLEASE READ all comments to essential parameters below, and set the parameters according to your project

set this to 1 if your Illumina jumping library reads are shorter than 100bp

EXTEND_JUMP_READS=0

this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 1

specifies whether to run the assembly on the grid

USE_GRID=0

specifies grid engine to use SGE or SLURM

GRID_ENGINE=SGE

specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY

GRID_QUEUE=all.q

batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=500000000

use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

can increase this to 30 or 35 if your reads are short (N50<7000bp)

LHE_COVERAGE=35

set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1

MEGA_READS_ONE_PASS=0

this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms

LIMIT_JUMP_COVERAGE = 300

these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.

CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS = cgwErrorRate=0.15

CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data

CLOSE_GAPS=1

number of cpus to use, set this to the number of CPUs/threads per node you will be using

NUM_THREADS = 96

this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20

JF_SIZE = 60000000000

ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.

Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0

If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).

Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.

A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.

Works well even when MEGA_READS_ONE_PASS is set to 1.

DO NOT use if you have less than 15x coverage by long reads.

FLYE_ASSEMBLY=0 END

olivierarmant commented 3 years ago

Or do you think it is specific to the MT genome as it is circular? Wonder if you encoutered such observations in your precedent assemblies?

alekseyzimin commented 3 years ago

Yes, it is specific to Mt genome. On circular chromosomes/plasmids with long reads assembly sometimes duplicates regions that would close the circle when the genome is circularized.