Open olivierarmant opened 3 years ago
Or do you think it is specific to the MT genome as it is circular? Wonder if you encoutered such observations in your precedent assemblies?
Yes, it is specific to Mt genome. On circular chromosomes/plasmids with long reads assembly sometimes duplicates regions that would close the circle when the genome is circularized.
Dear Alekseyzimin, Thanx a lot for producing masurca and for your continuous efforts for the comunity. :) I used masurca for a vertebrate genome using PE (250 bp and 100bp), MP and few nanopore sequences. The assembly went fine. I have a question regarding the presence of potential replicated regions in the assembly. Indeed by checking the scaffold assembled for the MT chromosom I did find that a large region (> 5kb) is duplicated (including the NADH gene). It is unlikely that this is real and I suppose this is an artefact. I would like to know if it is possible to tune masurca in order to be more stringent and avoid such replicated regions? Or could you suggest Tools to polish the assembly and remove such artefacts. Thanx in advance for your help! Below is the config file I used
DATA
Illumina paired end reads supplied as
if single-end, do not specify
MUST HAVE Illumina paired end reads to use MaSuRCA
PE= p1 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz PE= p2 550 250 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz PE= p3 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_350PE.R2.fastq.gz PE= p4 350 150 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubIISample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubIISample_Y2_350PE.R2.fastq.gz PE= p5 550 250 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_550PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_550PE.R2.fastq.gz
Illumina mate pair reads supplied as
JUMP= m1 4000 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz
JUMP= m2 3000 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz
JUMP= m3 1200 700 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/SubISample_Y2_gelfree_250bp.2.fq.gz JUMP= m4 3000 700 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz JUMP= m5 8000 3000 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz
JUMP= m6 2500 500 /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz
pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
if you have both types of reads supply them both as NANOPORE type
PACBIO=/FULL_PATH/pacbio.fa
NANOPORE=/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Nanopore.fastq.gz
Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
OTHER=/FULL_PATH/file.frg
synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
REFERENCE=/FULL_PATH/nanopore.fa
END
PARAMETERS
PLEASE READ all comments to essential parameters below, and set the parameters according to your project
set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
set this to 1 for all Illumina-only assemblies
set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
USE_LINKING_MATES = 1
specifies whether to run the assembly on the grid
USE_GRID=0
specifies grid engine to use SGE or SLURM
GRID_ENGINE=SGE
specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
GRID_QUEUE=all.q
batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=500000000
use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
can increase this to 30 or 35 if your reads are short (N50<7000bp)
LHE_COVERAGE=35
set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1
MEGA_READS_ONE_PASS=0
this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.
CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS = cgwErrorRate=0.15
CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data
CLOSE_GAPS=1
number of cpus to use, set this to the number of CPUs/threads per node you will be using
NUM_THREADS = 96
this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20
JF_SIZE = 60000000000
ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.
Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).
Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.
A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.
Works well even when MEGA_READS_ONE_PASS is set to 1.
DO NOT use if you have less than 15x coverage by long reads.
FLYE_ASSEMBLY=0 END