Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

PACBIO=/home/jterol/PacBio/PACBIO_clem.fa

NANOPORE=/FULL_PATH/nanopore.fa

OTHER=/FULL_PATH/file.frg

END

PARAMETERS

set this to 1 if your Illumina jumping library reads are shorter than 100bp

EXTEND_JUMP_READS=0

this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 0

specifies whether to run mega-reads correction on the grid

USE_GRID=0

specifies queue to use when running on the grid MANDATORY

GRID_QUEUE=all.q

batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=300000000

use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

LHE_COVERAGE=25

set to 1 to only do one pass of mega-reads, for faster but worse quality assembly

MEGA_READS_ONE_PASS=0

this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms

LIMIT_JUMP_COVERAGE = 300

these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS = cgwErrorRate=0.15

minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100

KMER_COUNT_THRESHOLD = 1

whether to attempt to close gaps in scaffolds with Illumina data

CLOSE_GAPS=1

auto-detected number of cpus to use

NUM_THREADS = 32

this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage

JF_SIZE = 3000000000

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastq @HWI-ST459_0069:1:1:1263:1962#0/1 GGGGGGGGAGGGGAGGAGGGGAGGGGGGGGGGGTGGGGGTGAGTGGAGGANAGGAGGGGNGNGAATGAGGAGGTAAGGGGGGAGGTTGGGTGAGGGAAGC +HWI-ST459_0069:1:1:1263:1962#0/1 _WQX_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-ST459_0069:1:1:1354:1977#0/1 GGAGGGGGGGGGGGGGGGGGCCGGGGGGGGGGCGGGGGGGGGGGCGAGGGNGGGGGGGGGGGGGGAGAGGTGGAGGGGGGGGGCAGGGGGTGAGGGGAGG +HWI-ST459_0069:1:1:1354:1977#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB [root@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489 GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

you should try fasta format using pacbio reads. best

On Thu, Oct 4, 2018 at 3:21 AM jterol notifications@github.com wrote:

Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with

absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

PACBIO=/home/jterol/PacBio/PACBIO_clem.fa

NANOPORE=/FULL_PATH/nanopore.fa

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files

into one if you have many

OTHER=/FULL_PATH/file.frg

END

PARAMETERS

set this to 1 if your Illumina jumping library reads are shorter than

100bp EXTEND_JUMP_READS=0

this is k-mer size for deBruijn graph values between 25 and 127 are

supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or

Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc) USE_LINKING_MATES = 0

specifies whether to run mega-reads correction on the grid

USE_GRID=0

specifies queue to use when running on the grid MANDATORY

GRID_QUEUE=all.q

batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=300000000

use at most this much coverage by the longest Pacbio or Nanopore reads,

discard the rest of the reads LHE_COVERAGE=25

set to 1 to only do one pass of mega-reads, for faster but worse quality

assembly MEGA_READS_ONE_PASS=0

this parameter is useful if you have too many Illumina jumping library

mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 300

these are the additional parameters to Celera Assembler. do not worry

about performance, number or processors or batch sizes -- these are computed automatically.

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other

organisms. CA_PARAMETERS = cgwErrorRate=0.15

minimum count k-mers used in error correction 1 means all k-mers are

used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1

whether to attempt to close gaps in scaffolds with Illumina data

CLOSE_GAPS=1

auto-detected number of cpus to use

NUM_THREADS = 32

this is mandatory jellyfish hash size -- a safe value is

estimated_genome_size*estimated_coverage JF_SIZE = 3000000000

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly

will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastq @HWI-ST459_0069:1:1:1263:1962#0/1

GGGGGGGGAGGGGAGGAGGGGAGGGGGGGGGGGTGGGGGTGAGTGGAGGANAGGAGGGGNGNGAATGAGGAGGTAAGGGGGGAGGTTGGGTGAGGGAAGC +HWI-ST459_0069:1:1:1263:1962#0/1

_WQX_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-ST459_0069:1:1:1354:1977#0/1

GGAGGGGGGGGGGGGGGGGGCCGGGGGGGGGGCGGGGGGGGGGGCGAGGGNGGGGGGGGGGGGGGAGAGGTGGAGGGGGGGGGCAGGGGGTGAGGGGAGG +HWI-ST459_0069:1:1:1354:1977#0/1

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB [root@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489

GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG

CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA

TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/AXaRKEFPVdVjcxZHM9Rk3WXHEom43OOWks5uhdMXgaJpZM4XHsfR .

-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA

dear jterol: have you solved your problem? I am trying to run masurca with only Illumina pair end reads and met the same problem.if this problem is associated with running out of memory?

alekseyzimin / masurca

Error correction of PE reads failed. Check pe.cor.log. #70

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

if you have both types of reads supply them both as NANOPORE type

NANOPORE=/FULL_PATH/nanopore.fa

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many

OTHER=/FULL_PATH/file.frg

set this to 1 if your Illumina jumping library reads are shorter than 100bp

this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

specifies whether to run mega-reads correction on the grid

specifies queue to use when running on the grid MANDATORY

batch size in the amount of long read sequence for each batch on the grid

use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

set to 1 to only do one pass of mega-reads, for faster but worse quality assembly

this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms

these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically.

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100

whether to attempt to close gaps in scaffolds with Illumina data

auto-detected number of cpus to use

this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data

Illumina paired end reads supplied as

if single-end, do not specify

MUST HAVE Illumina paired end reads to use MaSuRCA

Illumina mate pair reads supplied as

pacbio OR nanopore reads must be in a single fasta or fastq file with

if you have both types of reads supply them both as NANOPORE type

NANOPORE=/FULL_PATH/nanopore.fa

Other reads (Sanger, 454, etc) one frg file, concatenate your frg files

OTHER=/FULL_PATH/file.frg

set this to 1 if your Illumina jumping library reads are shorter than

this is k-mer size for deBruijn graph values between 25 and 127 are

set this to 1 for all Illumina-only assemblies

set this to 0 if you have more than 15x coverage by long reads (Pacbio or

specifies whether to run mega-reads correction on the grid

specifies queue to use when running on the grid MANDATORY

batch size in the amount of long read sequence for each batch on the grid

use at most this much coverage by the longest Pacbio or Nanopore reads,

set to 1 to only do one pass of mega-reads, for faster but worse quality

this parameter is useful if you have too many Illumina jumping library

these are the additional parameters to Celera Assembler. do not worry

set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other

minimum count k-mers used in error correction 1 means all k-mers are

whether to attempt to close gaps in scaffolds with Illumina data

auto-detected number of cpus to use

this is mandatory jellyfish hash size -- a safe value is

set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly