gbouras13 / plassembler

Program to quickly and accurately assemble plasmids in hybrid and long-only sequenced bacterial isolates
MIT License
50 stars 3 forks source link

Plassembler stops at samtools view step #2

Closed gaworj closed 1 year ago

gaworj commented 1 year ago

Hello,

Thanks for a very useful tool.

I have sucessfully installed plassembler but in my case the pipeline does not finish as expected.

When I try to run:

plassembler.py -d /home/data_HDD2/plassembler_db/ -l 4-LPC100_ont_1kb_q12.fastq.gz -1 4-LPC100_trim_R1.fastq.gz -2 4-LPC100_trim_R2.fastq.gz -c 3100000 --threads 40 -o 4-LPC100_plassembler -p 4-LPC100 Starting plassembler v0.1.4 Checking dependencies. Flye version found is v2.9.1-b1780. Flye version is ok. Unicycler version found is v0.5.0. Unicycler version is ok. Checking database installation. Database successfully checked. Checking input fastqs. FASTQ 4-LPC100_ont_1kb_q12.fastq.gz checked FASTQ 4-LPC100_trim_R1.fastq.gz checked FASTQ 4-LPC100_trim_R2.fastq.gz checked Filtering long reads. Running Flye. Counting Contigs. Flye assembled 3 contigs. More than one contig was assembled with Flye. Extracting Chromosome. Chromosome Identified. Plassembler will now use long and short reads to assemble plasmids accurately. Trimming short reads. Mapping Long Reads to Putative Plasmid Contigs. Mapping Long Reads to Chromosome. Mapping Short Reads to Putative Plasmid Contigs Mapping Short Reads to Chromosome Contig Processing Bams. Error with samtools view.

Here is the log file output:

2023-01-14 22:29:28,120 - INFO - Starting plassembler v0.1.4 2023-01-14 22:29:28,120 - INFO - Input args: Namespace(database='/home/data_HDD2/plassembler_db/', longreads='4-LPC100_ont_1kb_q12.fastq.gz', short_one='4-LPC100_trim_R1.fastq.gz', short_two='4-LPC100_trim_R2.fastq.gz', chromosome='3100000', outdir='4-LPC100_plassembler', min_length='500', threads='40', force=False, raw_flag=False, prefix='4-LPC100', min_quality='9') 2023-01-14 22:29:28,120 - INFO - Checking dependencies. 2023-01-14 22:29:28,200 - INFO - Flye version found is v2.9.1-b1780. 2023-01-14 22:29:28,200 - INFO - Flye version is ok. 2023-01-14 22:29:28,260 - INFO - Unicycler version found is v0.5.0. 2023-01-14 22:29:28,260 - INFO - Unicycler version is ok. 2023-01-14 22:29:28,260 - INFO - Checking database installation. 2023-01-14 22:29:28,260 - INFO - Database successfully checked. 2023-01-14 22:29:28,260 - INFO - Checking input fastqs 2023-01-14 22:29:28,266 - INFO - Filtering long reads. 2023-01-14 22:31:45,781 - INFO - Running Flye 2023-01-14 22:54:14,877 - INFO - Counting Contigs 2023-01-14 22:54:14,880 - INFO - More than one contig was assembled with Flye. 2023-01-14 22:54:14,880 - INFO - Extracting Chromosome. 2023-01-14 22:54:14,942 - INFO - Chromosome Identified. Plassembler will now use both long and short reads to assemble plasmids accurately. 2023-01-14 22:54:14,942 - INFO - Trimming short reads. 2023-01-14 22:54:21,360 - INFO - Read1 before filtering:

2023-01-14 22:54:21,360 - INFO - total reads: 485740

2023-01-14 22:54:21,360 - INFO - total bases: 104249485

2023-01-14 22:54:21,360 - INFO - Q20 bases: 102792034(98.602%)

2023-01-14 22:54:21,360 - INFO - Q30 bases: 98751562(94.7262%)

2023-01-14 22:54:21,360 - INFO -

2023-01-14 22:54:21,360 - INFO - Read2 before filtering:

2023-01-14 22:54:21,360 - INFO - total reads: 485740

2023-01-14 22:54:21,360 - INFO - total bases: 98536380

2023-01-14 22:54:21,360 - INFO - Q20 bases: 92691283(94.0681%)

2023-01-14 22:54:21,360 - INFO - Q30 bases: 83616899(84.8589%)

2023-01-14 22:54:21,360 - INFO -

2023-01-14 22:54:21,360 - INFO - Read1 after filtering:

2023-01-14 22:54:21,360 - INFO - total reads: 485739

2023-01-14 22:54:21,361 - INFO - total bases: 104249194

2023-01-14 22:54:21,361 - INFO - Q20 bases: 102791772(98.602%)

2023-01-14 22:54:21,361 - INFO - Q30 bases: 98751353(94.7263%)

2023-01-14 22:54:21,361 - INFO -

2023-01-14 22:54:21,361 - INFO - Read2 after filtering:

2023-01-14 22:54:21,361 - INFO - total reads: 485739

2023-01-14 22:54:21,361 - INFO - total bases: 98535962

2023-01-14 22:54:21,361 - INFO - Q20 bases: 92690991(94.0682%)

2023-01-14 22:54:21,361 - INFO - Q30 bases: 83616711(84.8591%)

2023-01-14 22:54:21,361 - INFO -

2023-01-14 22:54:21,361 - INFO - Filtering result:

2023-01-14 22:54:21,361 - INFO - reads passed filter: 971478

2023-01-14 22:54:21,361 - INFO - reads failed due to low quality: 2

2023-01-14 22:54:21,361 - INFO - reads failed due to too many N: 0

2023-01-14 22:54:21,361 - INFO - reads failed due to too short: 0

2023-01-14 22:54:21,361 - INFO - reads with adapter trimmed: 54

2023-01-14 22:54:21,361 - INFO - bases trimmed due to adapters: 555

2023-01-14 22:54:21,361 - INFO -

2023-01-14 22:54:21,361 - INFO - Duplication rate: 0.0430271%

2023-01-14 22:54:21,361 - INFO -

2023-01-14 22:54:21,361 - INFO - Insert size peak (evaluated by paired-end reads): 152

2023-01-14 22:54:21,463 - INFO -

2023-01-14 22:54:21,463 - INFO - JSON report: fastp.json

2023-01-14 22:54:21,463 - INFO - HTML report: fastp.html

2023-01-14 22:54:21,463 - INFO -

2023-01-14 22:54:21,463 - INFO - fastp --in1 4-LPC100_trim_R1.fastq.gz --in2 4-LPC100_trim_R2.fastq.gz --out1 4-LPC100_plassembler/trimmed_R1.fastq --out2 4-LPC100_plassembler/trimmed_R2.fastq

2023-01-14 22:54:21,463 - INFO - fastp v0.23.2, time used: 7 seconds

2023-01-14 22:54:22,473 - INFO - Mapping Long Reads to Putative Plasmid Contigs. 2023-01-14 22:54:22,478 - INFO - [M::mm_idx_gen::0.004*1.34] collected minimizers

2023-01-14 22:54:22,480 - INFO - [M::mm_idx_gen::0.006*3.85] sorted minimizers

2023-01-14 22:54:22,480 - INFO - [M::main::0.006*3.83] loaded/built the index for 2 target sequence(s)

2023-01-14 22:54:22,481 - INFO - [M::mm_mapopt_update::0.007*3.56] mid_occ = 10

2023-01-14 22:54:22,481 - INFO - [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2

2023-01-14 22:54:22,481 - INFO - [M::mm_idx_stat::0.007*3.37] distinct minimizers: 13345 (81.22% are singletons); average occurrences: 1.199; average spacing: 5.321; total length: 85103

2023-01-14 22:54:35,391 - INFO - [M::worker_pipeline::12.917*21.07] mapped 28247 sequences

2023-01-14 22:54:35,397 - INFO - [M::main] Version: 2.24-r1122

2023-01-14 22:54:35,397 - INFO - [M::main] CMD: minimap2 -ax map-ont -t 40 4-LPC100_plassembler/non_chromosome.fasta 4-LPC100_plassembler/filtered_long_reads.fastq.gz

2023-01-14 22:54:35,397 - INFO - [M::main] Real time: 12.923 sec; CPU: 272.115 sec; Peak RSS: 6.733 GB

2023-01-14 22:54:35,438 - INFO - Mapping Long Reads to Chromosome. 2023-01-14 22:54:35,543 - INFO - [M::mm_idx_gen::0.103*1.01] collected minimizers

2023-01-14 22:54:35,553 - INFO - [M::mm_idx_gen::0.114*2.34] sorted minimizers

2023-01-14 22:54:35,553 - INFO - [M::main::0.114*2.34] loaded/built the index for 1 target sequence(s)

2023-01-14 22:54:35,563 - INFO - [M::mm_mapopt_update::0.124*2.23] mid_occ = 14

2023-01-14 22:54:35,563 - INFO - [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1

2023-01-14 22:54:35,569 - INFO - [M::mm_idx_stat::0.129*2.18] distinct minimizers: 552383 (98.32% are singletons); average occurrences: 1.042; average spacing: 5.341; total length: 3075531

2023-01-14 22:54:53,428 - INFO - [M::worker_pipeline::17.989*25.96] mapped 28247 sequences

2023-01-14 22:54:53,443 - INFO - [M::main] Version: 2.24-r1122

2023-01-14 22:54:53,443 - INFO - [M::main] CMD: minimap2 -ax map-ont -t 40 4-LPC100_plassembler/chromosome.fasta 4-LPC100_plassembler/filtered_long_reads.fastq.gz

2023-01-14 22:54:53,443 - INFO - [M::main] Real time: 18.004 sec; CPU: 466.944 sec; Peak RSS: 1.955 GB

2023-01-14 22:54:53,547 - INFO - Mapping Short Reads to Putative Plasmid Contigs 2023-01-14 22:54:53,549 - INFO - [M::bwa_idx_load_from_disk] read 0 ALT contigs

2023-01-14 22:54:54,740 - INFO - [M::process] read 971478 sequences (202785156 bp)...

2023-01-14 22:55:00,134 - INFO - [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (24, 32818, 20, 34)

2023-01-14 22:55:00,134 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation FF...

2023-01-14 22:55:00,134 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (1736, 2854, 7435)

2023-01-14 22:55:00,134 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 18833)

2023-01-14 22:55:00,135 - INFO - [M::mem_pestat] mean and std.dev: (3654.58, 2785.73)

2023-01-14 22:55:00,135 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 24532)

2023-01-14 22:55:00,135 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation FR...

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (147, 225, 353)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 765)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] mean and std.dev: (261.55, 152.01)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 971)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation RF...

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (2315, 4318, 8360)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 20450)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] mean and std.dev: (4504.20, 3212.77)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 26495)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation RR...

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (2714, 3258, 5732)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 11768)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] mean and std.dev: (3952.65, 2152.64)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 14786)

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] skip orientation FF

2023-01-14 22:55:00,136 - INFO - [M::mem_pestat] skip orientation RF

2023-01-14 22:55:00,137 - INFO - [M::mem_pestat] skip orientation RR

2023-01-14 22:55:00,761 - INFO - [M::mem_process_seqs] Processed 971478 reads in 224.370 CPU sec, 6.021 real sec

2023-01-14 22:55:02,356 - INFO - [main] Version: 0.7.17-r1188

2023-01-14 22:55:02,356 - INFO - [main] CMD: bwa mem -t 40 4-LPC100_plassembler/non_chromosome.fasta 4-LPC100_plassembler/trimmed_R1.fastq 4-LPC100_plassembler/trimmed_R2.fastq

2023-01-14 22:55:02,357 - INFO - [main] Real time: 8.807 sec; CPU: 226.545 sec

2023-01-14 22:55:02,428 - INFO - Mapping Short Reads to Chromosome Contig 2023-01-14 22:55:02,432 - INFO - [M::bwa_idx_load_from_disk] read 0 ALT contigs

2023-01-14 22:55:03,633 - INFO - [M::process] read 971478 sequences (202785156 bp)...

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (81, 370991, 73, 75)

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation FF...

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (706, 2055, 6071)

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 16801)

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] mean and std.dev: (3331.37, 2969.51)

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 22166)

2023-01-14 22:55:09,273 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation FR...

2023-01-14 22:55:09,289 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (154, 231, 363)

2023-01-14 22:55:09,289 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 781)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] mean and std.dev: (270.88, 154.43)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 990)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation RF...

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (977, 4053, 6587)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 17807)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] mean and std.dev: (4142.62, 3190.58)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 23417)

2023-01-14 22:55:09,290 - INFO - [M::mem_pestat] analyzing insert size distribution for orientation RR...

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] (25, 50, 75) percentile: (1898, 4776, 7990)

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 20174)

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] mean and std.dev: (4663.32, 3244.67)

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] low and high boundaries for proper pairs: (1, 26266)

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] skip orientation FF

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] skip orientation RF

2023-01-14 22:55:09,291 - INFO - [M::mem_pestat] skip orientation RR

2023-01-14 22:55:10,613 - INFO - [M::mem_process_seqs] Processed 971478 reads in 262.535 CPU sec, 6.980 real sec

2023-01-14 22:55:12,467 - INFO - [main] Version: 0.7.17-r1188

2023-01-14 22:55:12,468 - INFO - [main] CMD: bwa mem -t 40 4-LPC100_plassembler/chromosome.fasta 4-LPC100_plassembler/trimmed_R1.fastq 4-LPC100_plassembler/trimmed_R2.fastq

2023-01-14 22:55:12,468 - INFO - [main] Real time: 10.038 sec; CPU: 264.910 sec

2023-01-14 22:55:12,542 - INFO - Processing Bams.

I have tried to use it on my recent projects where small plasmids were identified using various assemblers and plassembler everytime stops at this stage.

Any hints?

Bests, Jan

gbouras13 commented 1 year ago

Hi Jan,

Can you check if samtools is installed? Just with samtools --help.

If it isn’t installed, please install it with conda install samtools (if you are using conda).

I’m pretty sure this is the issue - I have just checked the bioconda recipe and realised that I forgot to add in samtools, so that would explain this error if you used bioconda, thanks for this issue - it should work fine once installed.

I will fix the bioconda recipe now for future versions (v0.1.5), And thanks for trying out plassembler!

George

gaworj commented 1 year ago

Hi, George,

Thank you! I have followed your suggestion and checked wether the samtools was installed. Unfortunately not. After samtools installation in plassembler env everything works fine.

Bests, Jan

gbouras13 commented 1 year ago

No problem @gaworj, thanks again for raising the issue - the issue you encountered should be fixed automatically in v0.1.5 (awaiting approval for bioconda, available from GitHub already). Another thing I added was --kmer_mode intended for high quality Nanopore reads (R10.4) without short reads as a bit of an experiment, so feel free to try that if you have such data (I don't yet!).

George

gaworj commented 1 year ago

Sounds great!

Can you also add nano-raw and nanohq options for flye input? This would help people who are using older ont datastes. Another useful option will be the possibility to analyze (copy numer + plsdb search) user provided plasmid sequences that are already assembled.

Jan

gbouras13 commented 1 year ago

Hi Jan,

I have added the functionality you suggested in the 0.2.0 branch if you want to try it - I'm still doing some tests before I merge it into the main branch. Great idea!

It takes -a flag to activate what I have called "assembled mode" and an -i input FASTA file. The file must contain the chromosome and plasmids. The chromosome contig header needs to be named "chromosome". Then it calculates the depth and runs PLSDB. Compatible with long-only or both long and short read input Fastqs.

Also, by default plassembler uses nanohq - if you want nano-raw use the -r flag.

George

gbouras13 commented 1 year ago

@gaworj this is properly now available using in v1.0.0, which has been updated with many other changes too. You can calculate copy number based off long and/or short reads if you specify -a, along with an existing chromosome assembly (--input_chromosome) and plasmids --input_plasmids).

Closing this issue now - but let me know what you think if you give it a go.

George