alekseyzimin / masurca

GNU General Public License v3.0
242 stars 35 forks source link

Test Illumina+Pacbio data set for MaSuRCA question #134

Open tingyanchang opened 4 years ago

tingyanchang commented 4 years ago

I download all ftp files from article"Test Illumina+Pacbio data set for MaSuRCA" in"MaSuRCA genome assembly package" website. After run the assemble.sh I got these error

[rom1025tyc@clogin4 yeast]$ masurca config.txt Verifying PATHS... jellyfish OK runCA OK createSuperReadsForDirectory.perl OK nucmer OK mega_reads_assemble_cluster.sh OK creating script file for the actions...done. execute assemble.sh to run assembly [rom1025tyc@clogin4 yeast]$ ./assemble.sh [Mon Oct 7 13:12:33 CST 2019] Processing pe library reads [Mon Oct 7 13:12:39 CST 2019] Average PE read length 250 [Mon Oct 7 13:12:39 CST 2019] Using kmer size of 127 for the graph [Mon Oct 7 13:12:39 CST 2019] MIN_Q_CHAR: 33 WARNING: JF_SIZE set too low, increasing JF_SIZE to at least 187720586, this automatic increase may be not enough! [Mon Oct 7 13:12:39 CST 2019] Estimated genome size: 13813470 [Mon Oct 7 13:12:39 CST 2019] Computing super reads from PE [Mon Oct 7 13:12:39 CST 2019] Using CABOG from /home/u7/rom1025tyc/.conda/envs/prj0036/bin/../CA8/Linux-amd64/bin [Mon Oct 7 13:12:39 CST 2019] Running mega-reads correction/assembly [Mon Oct 7 13:12:39 CST 2019] Using mer size 15 for mapping, B=17, d=0.029 [Mon Oct 7 13:12:39 CST 2019] Estimated Genome Size 13813470 [Mon Oct 7 13:12:39 CST 2019] Estimated Ploidy 2 [Mon Oct 7 13:12:39 CST 2019] Using 32 threads [Mon Oct 7 13:12:39 CST 2019] Output prefix mr.41.15.17.0.029 [Mon Oct 7 13:12:39 CST 2019] Pacbio coverage <25x, using the longest subreads [Mon Oct 7 13:12:39 CST 2019] K-unitigs file guillaumeKUnitigsAtLeast32bases_all.41.fasta not found! [Mon Oct 7 13:12:39 CST 2019] mega-reads exited before assembly

is there any setting I should use in config file?

alekseyzimin commented 4 years ago

Please post output of "ls -lth" on the assembly folder.

mahalel commented 4 years ago

Hi there, I have just tried a test run and I am getting the same issue:

[Thu Mar 12 02:26:29 UTC 2020] Processing pe library reads [Thu Mar 12 02:27:09 UTC 2020] Average PE read length 250 [Thu Mar 12 02:27:09 UTC 2020] Using kmer size of 127 for the graph [Thu Mar 12 02:27:09 UTC 2020] MIN_Q_CHAR: 33 [Thu Mar 12 02:27:09 UTC 2020] Creating mer database for Quorum [Thu Mar 12 02:28:48 UTC 2020] Error correct PE [Thu Mar 12 02:42:07 UTC 2020] Estimating genome size [Thu Mar 12 02:43:50 UTC 2020] Estimated genome size: 13813470 [Thu Mar 12 02:43:50 UTC 2020] Creating k-unitigs with k=127 [Thu Mar 12 02:51:44 UTC 2020] Computing super reads from PE [Thu Mar 12 02:58:54 UTC 2020] Using CABOG from /opt/masurca/bin/../CA8/Linux-amd64/bin [Thu Mar 12 02:58:54 UTC 2020] Running mega-reads correction/assembly [Thu Mar 12 02:58:54 UTC 2020] Using mer size 15 for mapping, B=17, d=0.029 [Thu Mar 12 02:58:54 UTC 2020] Estimated Genome Size 13813470 [Thu Mar 12 02:58:54 UTC 2020] Estimated Ploidy 2 [Thu Mar 12 02:58:54 UTC 2020] Using 32 threads [Thu Mar 12 02:58:54 UTC 2020] Output prefix mr.41.15.17.0.029 [Thu Mar 12 02:58:54 UTC 2020] Pacbio coverage <25x, using the longest subreads [Thu Mar 12 02:58:59 UTC 2020] K-unitigs file guillaumeKUnitigsAtLeast32bases_all.41.fasta not found! [Thu Mar 12 02:58:59 UTC 2020] mega-reads exited before assembly

ls -latrh output:

-rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:54 Illumina_500bp_2x300_R1.part1.fastq -rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:55 Illumina_500bp_2x300_R2.part1.fastq -rw-r--r--. 1 slurm cyclecloud 487M Mar 11 22:55 Pacbio.40x.fasta -rw-r--r--. 1 slurm cyclecloud 44 Mar 11 22:55 masurca.sh -rw-r--r--. 1 slurm cyclecloud 47M Mar 11 22:58 superReadSequences.named.fasta -rw-r--r--. 1 slurm cyclecloud 381 Mar 12 02:08 config.txt -rwxr-xr-x. 1 root root 8.6K Mar 12 02:12 assemble.sh -rw-r--r--. 1 root root 419 Mar 12 02:26 slurm-21.out -rw-r--r--. 1 root root 10 Mar 12 02:26 meanAndStdevByPrefix.pe.txt -rw-r--r--. 1 root root 3.9G Mar 12 02:27 p1.renamed.fastq -rw-r--r--. 1 root root 4.8M Mar 12 02:27 pe_data.tmp -rw-r--r--. 1 root root 1.2G Mar 12 02:28 quorum_mer_db.jf -rw-r--r--. 1 root root 2.1M Mar 12 02:42 pe.cor.tmp.log -rw-r--r--. 1 root root 2.0G Mar 12 02:42 pe.cor.fa -rw-r--r--. 1 root root 431 Mar 12 02:42 quorum.err -rw-r--r--. 1 root root 181M Mar 12 02:43 k_u_hash_0 -rw-r--r--. 1 root root 9 Mar 12 02:43 ESTIMATED_GENOME_SIZE.txt -rw-r--r--. 1 root root 604 Mar 12 02:43 environment.sh -rw-r--r--. 1 root root 41M Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.fasta lrwxrwxrwx. 1 root root 41 Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.jump.fasta -> guillaumeKUnitigsAtLeast32bases_all.fasta -rw-r--r--. 1 root root 6.5M Mar 12 02:58 super1.err drwxr-xr-x. 2 root root 4.0K Mar 12 02:58 work1 -rw-r--r--. 1 root root 2 Mar 12 02:58 PLOIDY.txt -rw-r--r--. 1 root root 21 Mar 12 02:58 CA_dir.txt -rw-r--r--. 1 root root 367M Mar 12 02:58 pacbio_nonredundant.fa drwxr-xr-x. 3 slurm cyclecloud 4.0K Mar 12 02:58 .

Config file (am planning to use SLURM scheduler):

[root@ip-0A780A07 masurca-test]# cat config.txt PARAMETERS USE_LINKING_MATES=0 MEGA_READS_ONE_PASS=0 CA_PARAMETERS = cnsMinFrags=200 cgwErrorRate=0.25 ovlHashBlockLength=1000000 ovlRefBlockSize=100000 NUM_THREADS=32 JF_SIZE=200000000 USE_GRID=1 GRID_QUEUE=htc GRID_ENGINE=SLURM GRID_BATCH_SIZE=500000000 END

DATA PE= p1 500 50 Illumina_500bp_2x300_R1.part1.fastq Illumina_500bp_2x300_R2.part1.fastq PACBIO=Pacbio.40x.fasta END

Using MaSuRCA-3.3.8b.tar.gz compiled version

alekseyzimin commented 4 years ago

Hi

You can remove superReadSequences.named.fasta file and re-generate assemble.sh and re-run. See if you get any errors. The file guillaumeKUnitigsAtLeast32bases_all.41.fasta is created before superReadSequences.named.fasta

--Aleksey

On Wed, Mar 11, 2020 at 11:39 PM Andrei Mahalean notifications@github.com wrote:

Hi there, I have just tried a test run and I am getting the same issue:

[Thu Mar 12 02:26:29 UTC 2020] Processing pe library reads [Thu Mar 12 02:27:09 UTC 2020] Average PE read length 250 [Thu Mar 12 02:27:09 UTC 2020] Using kmer size of 127 for the graph [Thu Mar 12 02:27:09 UTC 2020] MIN_Q_CHAR: 33 [Thu Mar 12 02:27:09 UTC 2020] Creating mer database for Quorum [Thu Mar 12 02:28:48 UTC 2020] Error correct PE [Thu Mar 12 02:42:07 UTC 2020] Estimating genome size [Thu Mar 12 02:43:50 UTC 2020] Estimated genome size: 13813470 [Thu Mar 12 02:43:50 UTC 2020] Creating k-unitigs with k=127 [Thu Mar 12 02:51:44 UTC 2020] Computing super reads from PE [Thu Mar 12 02:58:54 UTC 2020] Using CABOG from /opt/masurca/bin/../CA8/Linux-amd64/bin [Thu Mar 12 02:58:54 UTC 2020] Running mega-reads correction/assembly [Thu Mar 12 02:58:54 UTC 2020] Using mer size 15 for mapping, B=17, d=0.029 [Thu Mar 12 02:58:54 UTC 2020] Estimated Genome Size 13813470 [Thu Mar 12 02:58:54 UTC 2020] Estimated Ploidy 2 [Thu Mar 12 02:58:54 UTC 2020] Using 32 threads [Thu Mar 12 02:58:54 UTC 2020] Output prefix mr.41.15.17.0.029 [Thu Mar 12 02:58:54 UTC 2020] Pacbio coverage <25x, using the longest subreads [Thu Mar 12 02:58:59 UTC 2020] K-unitigs file guillaumeKUnitigsAtLeast32bases_all.41.fasta not found! [Thu Mar 12 02:58:59 UTC 2020] mega-reads exited before assembly

ls -latrh output:

-rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:54 Illumina_500bp_2x300_R1.part1.fastq -rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:55 Illumina_500bp_2x300_R2.part1.fastq -rw-r--r--. 1 slurm cyclecloud 487M Mar 11 22:55 Pacbio.40x.fasta -rw-r--r--. 1 slurm cyclecloud 44 Mar 11 22:55 masurca.sh -rw-r--r--. 1 slurm cyclecloud 47M Mar 11 22:58 superReadSequences.named.fasta -rw-r--r--. 1 slurm cyclecloud 381 Mar 12 02:08 config.txt -rwxr-xr-x. 1 root root 8.6K Mar 12 02:12 assemble.sh -rw-r--r--. 1 root root 419 Mar 12 02:26 slurm-21.out -rw-r--r--. 1 root root 10 Mar 12 02:26 meanAndStdevByPrefix.pe.txt -rw-r--r--. 1 root root 3.9G Mar 12 02:27 p1.renamed.fastq -rw-r--r--. 1 root root 4.8M Mar 12 02:27 pe_data.tmp -rw-r--r--. 1 root root 1.2G Mar 12 02:28 quorum_mer_db.jf -rw-r--r--. 1 root root 2.1M Mar 12 02:42 pe.cor.tmp.log -rw-r--r--. 1 root root 2.0G Mar 12 02:42 pe.cor.fa -rw-r--r--. 1 root root 431 Mar 12 02:42 quorum.err -rw-r--r--. 1 root root 181M Mar 12 02:43 k_u_hash_0 -rw-r--r--. 1 root root 9 Mar 12 02:43 ESTIMATED_GENOME_SIZE.txt -rw-r--r--. 1 root root 604 Mar 12 02:43 environment.sh -rw-r--r--. 1 root root 41M Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.fasta lrwxrwxrwx. 1 root root 41 Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.jump.fasta -> guillaumeKUnitigsAtLeast32bases_all.fasta -rw-r--r--. 1 root root 6.5M Mar 12 02:58 super1.err drwxr-xr-x. 2 root root 4.0K Mar 12 02:58 work1 -rw-r--r--. 1 root root 2 Mar 12 02:58 PLOIDY.txt -rw-r--r--. 1 root root 21 Mar 12 02:58 CA_dir.txt -rw-r--r--. 1 root root 367M Mar 12 02:58 pacbio_nonredundant.fa drwxr-xr-x. 3 slurm cyclecloud 4.0K Mar 12 02:58 .

Config file (am planning to use SLURM scheduler):

[root@ip-0A780A07 masurca-test]# cat config.txt PARAMETERS USE_LINKING_MATES=0 MEGA_READS_ONE_PASS=0 CA_PARAMETERS = cnsMinFrags=200 cgwErrorRate=0.25 ovlHashBlockLength=1000000 ovlRefBlockSize=100000 NUM_THREADS=32 JF_SIZE=200000000 USE_GRID=1 GRID_QUEUE=htc GRID_ENGINE=SLURM GRID_BATCH_SIZE=500000000 END

DATA PE= p1 500 50 Illumina_500bp_2x300_R1.part1.fastq Illumina_500bp_2x300_R2.part1.fastq PACBIO=Pacbio.40x.fasta END

Using MaSuRCA-3.3.8b.tar.gz compiled version

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/134#issuecomment-597990380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHNT2ASQGRY7GEMDJO3RHBKONANCNFSM4I572DQA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

alekseyzimin commented 4 years ago

Hi,

I just published release 3.3.9 which would automatically fix your error.

Please check the releases page version 3.3.9

--Aleksey

On Thu, Mar 12, 2020 at 2:54 PM Aleksey Zimin aleksey.zimin@gmail.com wrote:

Hi

You can remove superReadSequences.named.fasta file and re-generate assemble.sh and re-run. See if you get any errors. The file guillaumeKUnitigsAtLeast32bases_all.41.fasta is created before superReadSequences.named.fasta

--Aleksey

On Wed, Mar 11, 2020 at 11:39 PM Andrei Mahalean notifications@github.com wrote:

Hi there, I have just tried a test run and I am getting the same issue:

[Thu Mar 12 02:26:29 UTC 2020] Processing pe library reads [Thu Mar 12 02:27:09 UTC 2020] Average PE read length 250 [Thu Mar 12 02:27:09 UTC 2020] Using kmer size of 127 for the graph [Thu Mar 12 02:27:09 UTC 2020] MIN_Q_CHAR: 33 [Thu Mar 12 02:27:09 UTC 2020] Creating mer database for Quorum [Thu Mar 12 02:28:48 UTC 2020] Error correct PE [Thu Mar 12 02:42:07 UTC 2020] Estimating genome size [Thu Mar 12 02:43:50 UTC 2020] Estimated genome size: 13813470 [Thu Mar 12 02:43:50 UTC 2020] Creating k-unitigs with k=127 [Thu Mar 12 02:51:44 UTC 2020] Computing super reads from PE [Thu Mar 12 02:58:54 UTC 2020] Using CABOG from /opt/masurca/bin/../CA8/Linux-amd64/bin [Thu Mar 12 02:58:54 UTC 2020] Running mega-reads correction/assembly [Thu Mar 12 02:58:54 UTC 2020] Using mer size 15 for mapping, B=17, d=0.029 [Thu Mar 12 02:58:54 UTC 2020] Estimated Genome Size 13813470 [Thu Mar 12 02:58:54 UTC 2020] Estimated Ploidy 2 [Thu Mar 12 02:58:54 UTC 2020] Using 32 threads [Thu Mar 12 02:58:54 UTC 2020] Output prefix mr.41.15.17.0.029 [Thu Mar 12 02:58:54 UTC 2020] Pacbio coverage <25x, using the longest subreads [Thu Mar 12 02:58:59 UTC 2020] K-unitigs file guillaumeKUnitigsAtLeast32bases_all.41.fasta not found! [Thu Mar 12 02:58:59 UTC 2020] mega-reads exited before assembly

ls -latrh output:

-rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:54 Illumina_500bp_2x300_R1.part1.fastq -rw-r--r--. 1 slurm cyclecloud 2.5G Mar 11 22:55 Illumina_500bp_2x300_R2.part1.fastq -rw-r--r--. 1 slurm cyclecloud 487M Mar 11 22:55 Pacbio.40x.fasta -rw-r--r--. 1 slurm cyclecloud 44 Mar 11 22:55 masurca.sh -rw-r--r--. 1 slurm cyclecloud 47M Mar 11 22:58 superReadSequences.named.fasta -rw-r--r--. 1 slurm cyclecloud 381 Mar 12 02:08 config.txt -rwxr-xr-x. 1 root root 8.6K Mar 12 02:12 assemble.sh -rw-r--r--. 1 root root 419 Mar 12 02:26 slurm-21.out -rw-r--r--. 1 root root 10 Mar 12 02:26 meanAndStdevByPrefix.pe.txt -rw-r--r--. 1 root root 3.9G Mar 12 02:27 p1.renamed.fastq -rw-r--r--. 1 root root 4.8M Mar 12 02:27 pe_data.tmp -rw-r--r--. 1 root root 1.2G Mar 12 02:28 quorum_mer_db.jf -rw-r--r--. 1 root root 2.1M Mar 12 02:42 pe.cor.tmp.log -rw-r--r--. 1 root root 2.0G Mar 12 02:42 pe.cor.fa -rw-r--r--. 1 root root 431 Mar 12 02:42 quorum.err -rw-r--r--. 1 root root 181M Mar 12 02:43 k_u_hash_0 -rw-r--r--. 1 root root 9 Mar 12 02:43 ESTIMATED_GENOME_SIZE.txt -rw-r--r--. 1 root root 604 Mar 12 02:43 environment.sh -rw-r--r--. 1 root root 41M Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.fasta lrwxrwxrwx. 1 root root 41 Mar 12 02:51 guillaumeKUnitigsAtLeast32bases_all.jump.fasta -> guillaumeKUnitigsAtLeast32bases_all.fasta -rw-r--r--. 1 root root 6.5M Mar 12 02:58 super1.err drwxr-xr-x. 2 root root 4.0K Mar 12 02:58 work1 -rw-r--r--. 1 root root 2 Mar 12 02:58 PLOIDY.txt -rw-r--r--. 1 root root 21 Mar 12 02:58 CA_dir.txt -rw-r--r--. 1 root root 367M Mar 12 02:58 pacbio_nonredundant.fa drwxr-xr-x. 3 slurm cyclecloud 4.0K Mar 12 02:58 .

Config file (am planning to use SLURM scheduler):

[root@ip-0A780A07 masurca-test]# cat config.txt PARAMETERS USE_LINKING_MATES=0 MEGA_READS_ONE_PASS=0 CA_PARAMETERS = cnsMinFrags=200 cgwErrorRate=0.25 ovlHashBlockLength=1000000 ovlRefBlockSize=100000 NUM_THREADS=32 JF_SIZE=200000000 USE_GRID=1 GRID_QUEUE=htc GRID_ENGINE=SLURM GRID_BATCH_SIZE=500000000 END

DATA PE= p1 500 50 Illumina_500bp_2x300_R1.part1.fastq Illumina_500bp_2x300_R2.part1.fastq PACBIO=Pacbio.40x.fasta END

Using MaSuRCA-3.3.8b.tar.gz compiled version

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/134#issuecomment-597990380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHNT2ASQGRY7GEMDJO3RHBKONANCNFSM4I572DQA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

mahalel commented 4 years ago

Hi Aleksey, appreciate it, that seems to work now.