DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
187 stars 44 forks source link

trouble with bamfilter #57

Closed guyleonard closed 5 years ago

guyleonard commented 7 years ago

Hi I have two issues with running the bamfilter program.

Firstly, I have an -i include list called scaffolds_vs_nt_1e-10_eukaryota_test.txt:

NODE_10002_length_2789_cov_4.69971
NODE_10005_length_2789_cov_41.3277
NODE_1000_length_17179_cov_4.50444
NODE_10012_length_2785_cov_44.919

I also have a sorted+indexed bam file that I created by indexing the scaffolds.fasta with bwa and then mapping my individual libraries with bwa mem, then merging those bams to one final bam with samtools. All v1.3. Then followed with samtools sort and index.

When I try the code below, it works fine.

samtools view -b scaffolds_mapped_all_reads_sorted.bam NODE_10002_length_2789_cov_4.69971 > test.bam

I get the reads mapped to that scaffolds, but when I try:

blobtools bamfilter -b scaffolds_mapped_all_reads_sorted.bam -i scaffolds_vs_nt_1e-10_eukaryota_test.txt --threads 16
[+] Reading ../MAPPING/scaffolds_mapped_all_reads_sorted.bam
[+] Filtering ../MAPPING/scaffolds_mapped_all_reads_sorted.bam ...
[+] Filtered InIn (pairs=0) ...
[+] Filtered InUn (pairs=0) ...
[+] Filtered ExIn (pairs=0) ...
Traceback (most recent call last):
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/bamfilter.py", line 64, in <module>
    main()
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/bamfilter.py", line 56, in main
    BtIO.parseBamForFilter(bam_f, include_unmapped, out_f, sequence_list, None, gzip, do_sort, keep_sorted, sort_threads)
  File "/home/cs02gl/single_cell_workflow/install_dependencies/build/blobtools/lib/BtIO.py", line 370, in parseBamForFilter
    info_string.append((read_pair_type + ' pairs', "{:,}".format(count), '{0:.1%}'.format(count / int(seen_reads / 2))))
ZeroDivisionError: division by zero

Well you can see the error message - none of the scaffold IDs match anything in the BAM - which can't be the case as the samtools command above works fine. Is there something obvious I have missed?

Secondly, almost incidentally - and this may be because blobtools uses samtools v1.5 (though there doesn't seem to be anything in the release notes to suggest this behaviour) but if I repeat the same command above but with "--sort" on my unsorted BAM file, instead of the 5.5GB file I expect it comes out as 13GB and then working with that file is impossible.

samtools view -b scaffolds_mapped_all_reads.bam.readsorted.bam NODE_10002_length_2789_cov_4.69971 > testing.bam
[main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

samtools index scaffolds_mapped_all_reads.bam.readsorted.bam 
[E::hts_idx_push] NO_COOR reads not in a single block at the end 31 -1
samtools index: "../MAPPING/scaffolds_mapped_all_reads.bam.readsorted.bam" is corrupted or unsorted

Any help is much appreciated! :)

DRL commented 7 years ago

Hi guyleonard,

Thanks for this. Have been busy these days, will look into it tomorrow.

cheers,

dom

faithman commented 6 years ago

Hi, I have the same problem, did you guys have a solution for this?

Thanks! Ye

DRL commented 6 years ago

Hi both,

is the BAM file sorted by readname (this is usually how it comes out of an aligner)? Because pairs have to be together in the BAM file ... a position sorted BAM file would cause an error...

let me know if that helps,

cheers,

dom

faithman commented 6 years ago

Hi @DRL ,

My bam file was sorted by read name, and pairs were gathered.

Thanks!

AntoineHo commented 5 years ago

Hello (and thanks for blobtools :) ), I want to extract all reads mapping to some scaffolds, thus I grepped all scaffolds name and ran the following command on my .bam file:

blobtools bamfilter --threads 14 -i contigs.list -u -b rawVpe.sorted.bam

My bam file is around 42 Gb but when I run the command, the threads go to dorment status, it runs overnight with 450 Gb RAM usage (still increasing) and it outputs a 0 bytes InIn.fq My stdout looks like this frozen at 99% for hours:

[+] Reading RAWvsPE.sorted.bam
[+] Filtering RAWvsPE.sorted.bam ...
[%]     99%

I guess this is not expected behaviour ? Any ideas of what goes wrong here ?