alekseyzimin / masurca

GNU General Public License v3.0
240 stars 35 forks source link

Error: "reading mega-reads file" / Masurca/3.3.4 #158

Open Menomens opened 4 years ago

Menomens commented 4 years ago

Hi, I'm trying to do de novo assembly of a fungi genome that has an average size of 53Mbp. I keep getting this error

[Wed Jan 29 14:52:12 CET 2020] Processing pe library reads
[Wed Jan 29 15:20:21 CET 2020] Average PE read length 151
[Wed Jan 29 15:20:21 CET 2020] Using kmer size of 99 for the graph
[Wed Jan 29 15:20:22 CET 2020] MIN_Q_CHAR: 33
[Wed Jan 29 15:20:22 CET 2020] Creating mer database for Quorum
[Wed Jan 29 18:46:45 CET 2020] Error correct PE
[Thu Jan 30 14:53:45 CET 2020] Estimating genome size
[Thu Jan 30 17:39:19 CET 2020] Estimated genome size: 231108880
[Thu Jan 30 17:39:19 CET 2020] Creating k-unitigs with k=99
[Thu Jan 30 23:54:41 CET 2020] Computing super reads from PE 
[Fri Jan 31 03:27:28 CET 2020] Using CABOG from /cm/shared/apps/masurca/3.3.4/bin/../CA8/Linux-amd64/bin
[Fri Jan 31 03:27:28 CET 2020] Running mega-reads correction/assembly
[Fri Jan 31 03:27:28 CET 2020] Using mer size 15 for mapping, B=17, d=0.029
[Fri Jan 31 03:27:28 CET 2020] Estimated Genome Size 231108880
[Fri Jan 31 03:27:28 CET 2020] Estimated Ploidy 1
[Fri Jan 31 03:27:28 CET 2020] Using 28 threads
[Fri Jan 31 03:27:28 CET 2020] Output prefix mr.41.15.17.0.029
[Fri Jan 31 03:27:28 CET 2020] Pacbio coverage <25x, using the longest subreads
[Fri Jan 31 03:27:45 CET 2020] Reducing super-read k-mer size
[Fri Jan 31 04:20:09 CET 2020] Mega-reads pass 1
[Fri Jan 31 04:20:09 CET 2020] Running locally in 1 batch
[Fri Jan 31 16:03:23 CET 2020] Mega-reads pass 2
[Fri Jan 31 16:03:23 CET 2020] Running locally in 1 batch
[Fri Jan 31 19:57:22 CET 2020] Refining alignments
[Fri Jan 31 20:17:30 CET 2020] Computing allowed merges
[Fri Jan 31 20:17:36 CET 2020] Joining
[Fri Jan 31 20:18:09 CET 2020] Gap consensus
[Fri Jan 31 20:18:31 CET 2020] Generating assembly input files
error reading mega-reads file at /cm/shared/apps/masurca/3.3.4/bin/find_contained_reads.pl line 33, <FILE> line 339401.
[Fri Jan 31 20:36:21 CET 2020] failed to create mega-reads frg file
[Fri Jan 31 20:36:21 CET 2020] mega-reads exited before assembly

This is the sr_config

DATA  
PE= pe 151 1 /home/cgianchino/data/masurcarun/illr1.fastq  /home/cgianchino/data/masurcarun/illr2.fastq
PACBIO=/home/cgianchino/data/pbio/pbdef.fasta
END

PARAMETERS
EXTEND_JUMP_READS=0
GRAPH_KMER_SIZE=auto
USE_LINKING_MATES=0
USE_GRID=0
GRID_ENGINE=SLURM
LHE_COVERAGE=25
MEGA_READS_ONE_PASS=0
CA_PARAMETERS= cgwErrorRate=0.1
CLOSE_GAPS=1
NUM_THREADS=28
JF_SIZE=5000000000
SOAP_ASSEMBLY=0
FLYE_ASSEMBLY=0
END

What can I do? Is it a problem related to the old version or am I wrong something?

alekseyzimin commented 4 years ago

Please re-run with the latest version and let me know if you are still having this problem.

MichaelFokinNZ commented 10 months ago

Hi, I am getting exactly the same error with two versions tested 4.0.9 and 4.1.0 on Centos7 NeSI cluster (looking for help from eng support there as well).

error reading mega-reads file at /scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/MaSuRCA/4.0.9-gimkl-2020a/bin/find_contained_reads.pl line 33, <FILE> line 230791. [Wed Sep 27 19:34:23 UTC 2023] failed to create mega-reads frg file

Haploid 42Mb genome, estimated by Masurca - 53Mb. High coverage >100x Illumina and ONT libraries (used well with Spades and Flye-Pilon)

Any other files/out to provide?

std_out

Verifying PATHS...
jellyfish OK
runCA OK
createSuperReadsForDirectory.perl OK
creating script file for the actions...done.
execute assemble.sh to run assembly
[Wed Sep 27 18:53:10 UTC 2023] Processing pe library reads
[Wed Sep 27 18:54:51 UTC 2023] Average PE read length 146
[Wed Sep 27 18:54:52 UTC 2023] Using kmer size of 99 for the graph
[Wed Sep 27 18:54:52 UTC 2023] MIN_Q_CHAR: 33
WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 469081568, this automatic increase may be not enough!
[Wed Sep 27 18:54:52 UTC 2023] Creating mer database for Quorum
[Wed Sep 27 18:57:23 UTC 2023] Error correct PE
[Wed Sep 27 19:04:24 UTC 2023] Estimating genome size
[Wed Sep 27 19:05:35 UTC 2023] Estimated genome size: 52710869
[Wed Sep 27 19:05:35 UTC 2023] Creating k-unitigs with k=99
[Wed Sep 27 19:09:14 UTC 2023] Computing super reads from PE 
[Wed Sep 27 19:16:14 UTC 2023] Using CABOG from /scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/MaSuRCA/4.0.9-gimkl-2020a/bin
[Wed Sep 27 19:16:14 UTC 2023] Running mega-reads correction/assembly
[Wed Sep 27 19:16:14 UTC 2023] Using mer size 17 for mapping, B=15, d=0.02
[Wed Sep 27 19:16:14 UTC 2023] Estimated Genome Size 52710869
[Wed Sep 27 19:16:14 UTC 2023] Estimated Ploidy 1
[Wed Sep 27 19:16:14 UTC 2023] Using 70 threads
[Wed Sep 27 19:16:14 UTC 2023] Output prefix mr.99.17.15.0.02
[Wed Sep 27 19:16:14 UTC 2023] Creating k-unitigs for k=19
[Wed Sep 27 19:17:32 UTC 2023] Pre-correcting long reads
[Wed Sep 27 19:27:57 UTC 2023] Pre-corrected reads are in longest_reads.25x.fa
[Wed Sep 27 19:27:59 UTC 2023] Computing mega-reads
[Wed Sep 27 19:27:59 UTC 2023] Running locally in 1 batch
[Wed Sep 27 19:30:42 UTC 2023] Refining alignments
[Wed Sep 27 19:31:57 UTC 2023] Computing allowed merges
[Wed Sep 27 19:32:03 UTC 2023] Joining
[Wed Sep 27 19:32:32 UTC 2023] Gap consensus
[Wed Sep 27 19:32:34 UTC 2023] Warning! Some or all gap consensus jobs failed, see files in mr.99.17.15.0.02.join_consensus.tmp, however this is fine and assembly can proceed normally
[Wed Sep 27 19:32:35 UTC 2023] Generating assembly input files
[Wed Sep 27 19:34:23 UTC 2023] mega-reads exited before assembly

config.txt

DATA
PE = pe 500 50 /PATH/R1_001.fastq /PATH/R2_001.fastq
NANOPORE = /PATH/Nanopore/barcode53.fastq
END
PARAMETERS
EXTEND_JUMP_READS=0
GRAPH_KMER_SIZE = auto
USE_LINKING_MATES = 0
USE_GRID=0
GRID_ENGINE=SGE
GRID_QUEUE=all.q
GRID_BATCH_SIZE=500000000
LHE_COVERAGE=25
LIMIT_JUMP_COVERAGE = 300
CA_PARAMETERS =  cgwErrorRate=0.15
CLOSE_GAPS=1
NUM_THREADS = 32
JF_SIZE = 200000000
SOAP_ASSEMBLY=0
FLYE_ASSEMBLY=0
END
alekseyzimin commented 10 months ago

Hello,

Thank you for reporting this, can you please post output of "ls -lth" command run in the assembly folder?

Thanks, Aleksey

On Fri, Sep 29, 2023 at 2:20 AM MichaelFokinNZ @.***> wrote:

Hi, I am getting exactly the same error with two versions tested 4.0.9 and 4.1.0 on Centos7 NeSI cluster (looking for help from eng support there as well).

error reading mega-reads file at /scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/MaSuRCA/4.0.9-gimkl-2020a/bin/ find_contained_reads.pl line 33, line 230791. [Wed Sep 27 19:34:23 UTC 2023] failed to create mega-reads frg file

Haploid 42Mb genome, estimated by Masurca - 53Mb. High coverage >100x Illumina and ONT libraries (used well with Spades and Flye-Pilon)

Any other files/out to provide?

std_out

Verifying PATHS... jellyfish OK runCA OK createSuperReadsForDirectory.perl OK creating script file for the actions...done. execute assemble.sh to run assembly [Wed Sep 27 18:53:10 UTC 2023] Processing pe library reads [Wed Sep 27 18:54:51 UTC 2023] Average PE read length 146 [Wed Sep 27 18:54:52 UTC 2023] Using kmer size of 99 for the graph [Wed Sep 27 18:54:52 UTC 2023] MIN_Q_CHAR: 33 WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 469081568, this automatic increase may be not enough! [Wed Sep 27 18:54:52 UTC 2023] Creating mer database for Quorum [Wed Sep 27 18:57:23 UTC 2023] Error correct PE [Wed Sep 27 19:04:24 UTC 2023] Estimating genome size [Wed Sep 27 19:05:35 UTC 2023] Estimated genome size: 52710869 [Wed Sep 27 19:05:35 UTC 2023] Creating k-unitigs with k=99 [Wed Sep 27 19:09:14 UTC 2023] Computing super reads from PE [Wed Sep 27 19:16:14 UTC 2023] Using CABOG from /scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/MaSuRCA/4.0.9-gimkl-2020a/bin [Wed Sep 27 19:16:14 UTC 2023] Running mega-reads correction/assembly [Wed Sep 27 19:16:14 UTC 2023] Using mer size 17 for mapping, B=15, d=0.02 [Wed Sep 27 19:16:14 UTC 2023] Estimated Genome Size 52710869 [Wed Sep 27 19:16:14 UTC 2023] Estimated Ploidy 1 [Wed Sep 27 19:16:14 UTC 2023] Using 70 threads [Wed Sep 27 19:16:14 UTC 2023] Output prefix mr.99.17.15.0.02 [Wed Sep 27 19:16:14 UTC 2023] Creating k-unitigs for k=19 [Wed Sep 27 19:17:32 UTC 2023] Pre-correcting long reads [Wed Sep 27 19:27:57 UTC 2023] Pre-corrected reads are in longest_reads.25x.fa [Wed Sep 27 19:27:59 UTC 2023] Computing mega-reads [Wed Sep 27 19:27:59 UTC 2023] Running locally in 1 batch [Wed Sep 27 19:30:42 UTC 2023] Refining alignments [Wed Sep 27 19:31:57 UTC 2023] Computing allowed merges [Wed Sep 27 19:32:03 UTC 2023] Joining [Wed Sep 27 19:32:32 UTC 2023] Gap consensus [Wed Sep 27 19:32:34 UTC 2023] Warning! Some or all gap consensus jobs failed, see files in mr.99.17.15.0.02.join_consensus.tmp, however this is fine and assembly can proceed normally [Wed Sep 27 19:32:35 UTC 2023] Generating assembly input files [Wed Sep 27 19:34:23 UTC 2023] mega-reads exited before assembly

config.txt

DATA PE = pe 500 50 /PATH/R1_001.fastq /PATH/R2_001.fastq NANOPORE = /PATH/Nanopore/barcode53.fastq END PARAMETERS EXTEND_JUMP_READS=0 GRAPH_KMER_SIZE = auto USE_LINKING_MATES = 0 USE_GRID=0 GRID_ENGINE=SGE GRID_QUEUE=all.q GRID_BATCH_SIZE=500000000 LHE_COVERAGE=25 LIMIT_JUMP_COVERAGE = 300 CA_PARAMETERS = cgwErrorRate=0.15 CLOSE_GAPS=1 NUM_THREADS = 32 JF_SIZE = 200000000 SOAP_ASSEMBLY=0 FLYE_ASSEMBLY=0 END

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/158#issuecomment-1740367766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHKH4BJ6JVGQ7RORF6TX4ZSEVANCNFSM4KYKI6GA . You are receiving this because you commented.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

MichaelFokinNZ commented 10 months ago

please find the content of output folder below. three dirs still there image

-rw-rw----+ 1 username project    0 Sep 27 09:32 containees.txt
drwxrws---+ 2 username project 4.0K Sep 27 09:32 work1_mr1
-rw-rw----+ 1 username project    0 Sep 27 09:32 reduce2.out
-rw-rw----+ 1 username project  14K Sep 27 09:31 super1.err
-rw-rw----+ 1 username project  60M Sep 27 09:31 guillaumeKUnitigsAtLeast32bases_all.31.fasta
-rw-rw----+ 1 username project 1.2G Sep 27 09:30 mr.fa.in
-rw-rw----+ 1 username project 1.2G Sep 27 09:29 mr.99.17.15.0.02.1.fa
drwxrws---+ 2 username project 4.0K Sep 27 09:29 mr.99.17.15.0.02.join_consensus.tmp
-rw-rw----+ 1 username project 128M Sep 27 09:29 mr.99.17.15.0.02.1.to_join.fa
-rw-rw----+ 1 username project 1.1G Sep 27 09:29 mr.99.17.15.0.02.1.unjoined.fa
-rw-rw----+ 1 username project 4.1M Sep 27 09:29 mr.99.17.15.0.02.1.allowed
-rw-rw----+ 1 username project 1.6G Sep 27 09:29 mr.99.17.15.0.02.all.txt
-rw-rw----+ 1 username project 1.6G Sep 27 09:26 mr.99.17.15.0.02.txt
-rw-rw----+ 1 username project   29 Sep 27 09:22 create_mega-reads.err
-rw-rw----+ 1 username project 199M Sep 27 09:22 superReadSequences.named.fasta
-rw-rw----+ 1 username project 1.3G Sep 27 09:22 longest_reads.25x.fa
-rw-rw----+ 1 username project   20 Sep 27 08:55 CA_dir.txt
-rw-rw----+ 1 username project    2 Sep 27 08:55 PLOIDY.txt
drwxrws---+ 2 username project 4.0K Sep 27 08:55 work1
lrwxrwxrwx  1 username project   41 Sep 27 08:45 guillaumeKUnitigsAtLeast32bases_all.jump.fasta -> guillaumeKUnitigsAtLeast32bases_all.fasta
-rw-rw----+ 1 username project 168M Sep 27 08:45 guillaumeKUnitigsAtLeast32bases_all.fasta
-rw-rw----+ 1 username project 1.6K Sep 27 08:39 environment.sh
-rw-rw----+ 1 username project    9 Sep 27 08:39 ESTIMATED_GENOME_SIZE.txt
-rw-rw----+ 1 username project 759M Sep 27 08:39 k_u_hash_0
-rw-rw----+ 1 username project  433 Sep 27 08:37 quorum.err
-rw-rw----+ 1 username project  12G Sep 27 08:37 pe.cor.fa
-rw-rw----+ 1 username project 5.2M Sep 27 08:37 pe.cor.tmp.log
-rw-rw----+ 1 username project 2.3G Sep 27 08:20 quorum_mer_db.jf
-rw-rw----+ 1 username project 2.9M Sep 27 08:17 pe_data.tmp
-rw-rw----+ 1 username project  22G Sep 27 08:17 pe.renamed.fastq
-rw-rw----+ 1 username project   10 Sep 27 08:15 meanAndStdevByPrefix.pe.txt
-rwxr-xr-x+ 1 username project 9.1K Sep 27 08:15 assemble.sh
-rwxr-xr-x+ 1 username project  137 Sep 27 08:15 run.sh
-rw-rw----+ 1 username project  639 Sep 27 08:15 config.txt

CPU/RAM usage for the process above image

also tried with more resources - was the same result image

MichaelFokinNZ commented 10 months ago

upd: I tried different dataset (also PE Illumina + Nanopore) - the same error...

MichaelFokinNZ commented 10 months ago

these realated issues? https://github.com/alekseyzimin/masurca/issues/313 https://github.com/alekseyzimin/masurca/issues/239

alekseyzimin commented 10 months ago

I have just downloaded MaSuRCA-4.1.0 release and successfully ran an assembly with Illumina PE and Nanopore (FLYE_ASSEMBLY=0). For ONT+Illumina data sets I recommend to set FLYE_ASSEMBLY=1 in the config file. Please make sure you are running MaSuRCA in a clean environment, and there are no conflicts with bioconda packages or perl libraries on the PATH.

On Sun, Oct 1, 2023 at 10:55 PM MichaelFokinNZ @.***> wrote:

these realter issues? #313 https://github.com/alekseyzimin/masurca/issues/313 #239 https://github.com/alekseyzimin/masurca/issues/239

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/158#issuecomment-1742336659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPM3DBQQ3QJ67YDCC3X5IUI3ANCNFSM4KYKI6GA . You are receiving this because you commented.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin commented 10 months ago

Most likely you have a failure in the MUMmer perl binding. This is a known conflict with bioconda mummer install. MaSuRCA installs and compiles easily from the distribution tarball, so you can just install it in a local clean environment and run.

On Tue, Oct 3, 2023 at 5:23 PM Aleksey Zimin @.***> wrote:

I have just downloaded MaSuRCA-4.1.0 release and successfully ran an assembly with Illumina PE and Nanopore (FLYE_ASSEMBLY=0). For ONT+Illumina data sets I recommend to set FLYE_ASSEMBLY=1 in the config file. Please make sure you are running MaSuRCA in a clean environment, and there are no conflicts with bioconda packages or perl libraries on the PATH.

On Sun, Oct 1, 2023 at 10:55 PM MichaelFokinNZ @.***> wrote:

these realter issues? #313 https://github.com/alekseyzimin/masurca/issues/313 #239 https://github.com/alekseyzimin/masurca/issues/239

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/158#issuecomment-1742336659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPM3DBQQ3QJ67YDCC3X5IUI3ANCNFSM4KYKI6GA . You are receiving this because you commented.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

MichaelFokinNZ commented 10 months ago

Thank you! I have a gut feeling that it might be related to high coverage (>100x) ONT dataset (both I tried). Can you recommend any available/reproducible SRA dataset (Illumina+ONT) to try?

alekseyzimin commented 10 months ago

MaSuRCA should run just fine on up to 150x Illumina PE coverage. I do not recommend using more than that, the assembly will run, but the results will be worse. I have just uploaded a data set for a 37Mbp fungal genome to our anonymous ftp. These data are public, but I do not remember SRA ids: ftp://ftp.ccb.jhu.edu/pub/alekseyz/L.prolificans/lprol.tgz Assembly runs in about 30 minutes on a 24-core server. MaSuRCA 4.1.0 yields ~37Mbp assembly in ~40 contigs with contig N50 of about 2Mbp.

On Tue, Oct 3, 2023 at 7:51 PM MichaelFokinNZ @.***> wrote:

Thank you! I have a gut feeling that it might be related to high coverage (>100x) ONT dataset (both I tried). Can you recommend any available/reproducible SRA dataset (Illumina+ONT) to try?

— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/158#issuecomment-1745901667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHJIY4HKI33ZBAKUXKTX5SQGVAVCNFSM4KYKI6GKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZUGU4TAMJWGY3Q . You are receiving this because you commented.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

MichaelFokinNZ commented 10 months ago

Hi Aleksey, with NeSI's (NZ) cluster support help we figured out the cause of this and I run your dataset successfully.

That is perl common issue with ssh sessions opened from Windows clients ( Or some Mac clients) - WSL1 in my case. I got following error

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

that was fixed permanently by adding following lines to my ~/.bashrc on server side. I assume that can be the part of the run script as well.

# To get rid of the Perl location warning.
export LANGUAGE=en_NZ.UTF-8
export LC_ALL=en_NZ.UTF-8
export LANG=en_NZ.UTF-8
export LC_CTYPE=en_NZ.UTF-8

I see that issues popping out in few threads, so the cause and solution might be useful (likely obvious for perl experts).