alekseyzimin / masurca

GNU General Public License v3.0
243 stars 35 forks source link

Error correction of PE reads failed. Check pe.cor.log. #70

Open jterol opened 5 years ago

jterol commented 5 years ago

Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

PACBIO=/home/jterol/PacBio/PACBIO_clem.fa

NANOPORE=/FULL_PATH/nanopore.fa

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many

OTHER=/FULL_PATH/file.frg

END

PARAMETERS

set this to 1 if your Illumina jumping library reads are shorter than 100bp

EXTEND_JUMP_READS=0

this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 0

specifies whether to run mega-reads correction on the grid

USE_GRID=0

specifies queue to use when running on the grid MANDATORY

GRID_QUEUE=all.q

batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=300000000

use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

LHE_COVERAGE=25

set to 1 to only do one pass of mega-reads, for faster but worse quality assembly

MEGA_READS_ONE_PASS=0

this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms

LIMIT_JUMP_COVERAGE = 300

these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS = cgwErrorRate=0.15

minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100

KMER_COUNT_THRESHOLD = 1

whether to attempt to close gaps in scaffolds with Illumina data

CLOSE_GAPS=1

auto-detected number of cpus to use

NUM_THREADS = 32

this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage

JF_SIZE = 3000000000

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastq @HWI-ST459_0069:1:1:1263:1962#0/1 GGGGGGGGAGGGGAGGAGGGGAGGGGGGGGGGGTGGGGGTGAGTGGAGGANAGGAGGGGNGNGAATGAGGAGGTAAGGGGGGAGGTTGGGTGAGGGAAGC +HWI-ST459_0069:1:1:1263:1962#0/1 _WQX_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-ST459_0069:1:1:1354:1977#0/1 GGAGGGGGGGGGGGGGGGGGCCGGGGGGGGGGCGGGGGGGGGGGCGAGGGNGGGGGGGGGGGGGGAGAGGTGGAGGGGGGGGGCAGGGGGTGAGGGGAGG +HWI-ST459_0069:1:1:1354:1977#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB [root@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489 GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

sunnycqcn commented 5 years ago

you should try fasta format using pacbio reads. best

On Thu, Oct 4, 2018 at 3:21 AM jterol notifications@github.com wrote:

Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with

absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

PACBIO=/home/jterol/PacBio/PACBIO_clem.fa

NANOPORE=/FULL_PATH/nanopore.fa

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files

into one if you have many

OTHER=/FULL_PATH/file.frg

END

PARAMETERS

set this to 1 if your Illumina jumping library reads are shorter than

100bp EXTEND_JUMP_READS=0

this is k-mer size for deBruijn graph values between 25 and 127 are

supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or

Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc) USE_LINKING_MATES = 0

specifies whether to run mega-reads correction on the grid

USE_GRID=0

specifies queue to use when running on the grid MANDATORY

GRID_QUEUE=all.q

batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=300000000

use at most this much coverage by the longest Pacbio or Nanopore reads,

discard the rest of the reads LHE_COVERAGE=25

set to 1 to only do one pass of mega-reads, for faster but worse quality

assembly MEGA_READS_ONE_PASS=0

this parameter is useful if you have too many Illumina jumping library

mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 300

these are the additional parameters to Celera Assembler. do not worry

about performance, number or processors or batch sizes -- these are computed automatically.

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other

organisms. CA_PARAMETERS = cgwErrorRate=0.15

minimum count k-mers used in error correction 1 means all k-mers are

used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1

whether to attempt to close gaps in scaffolds with Illumina data

CLOSE_GAPS=1

auto-detected number of cpus to use

NUM_THREADS = 32

this is mandatory jellyfish hash size -- a safe value is

estimated_genome_size*estimated_coverage JF_SIZE = 3000000000

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly

will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastq @HWI-ST459_0069:1:1:1263:1962#0/1

GGGGGGGGAGGGGAGGAGGGGAGGGGGGGGGGGTGGGGGTGAGTGGAGGANAGGAGGGGNGNGAATGAGGAGGTAAGGGGGGAGGTTGGGTGAGGGAAGC +HWI-ST459_0069:1:1:1263:1962#0/1

_WQX_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-ST459_0069:1:1:1354:1977#0/1

GGAGGGGGGGGGGGGGGGGGCCGGGGGGGGGGCGGGGGGGGGGGCGAGGGNGGGGGGGGGGGGGGAGAGGTGGAGGGGGGGGGCAGGGGGTGAGGGGAGG +HWI-ST459_0069:1:1:1354:1977#0/1

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB [root@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489

GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG

CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA

TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/AXaRKEFPVdVjcxZHM9Rk3WXHEom43OOWks5uhdMXgaJpZM4XHsfR .

-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA

lly1214 commented 2 years ago

dear jterol: have you solved your problem? I am trying to run masurca with only Illumina pair end reads and met the same problem.if this problem is associated with running out of memory?

lly1214 commented 2 years ago

dear jterol: have you solved your problem? I am trying to run masurca with only Illumina pair end reads and met the same problem.if this problem is associated with running out of memory?