czbiohub-sf / nf-predictorthologs

*de novo* orthologous gene predictions from bam + bed or fasta/fastq data
MIT License
4 stars 2 forks source link

Use sambamba to remove duplicates of bam files #57

Closed olgabot closed 4 years ago

olgabot commented 4 years ago

This pipeline can be very slow for large bam files, which may contain many duplicate reads, reads that align to exactly the same position in the genome and thus likely have the same sequence.

Sambamba is a single-command and multithreaded way to remove duplicate reads from a bam file. Sambamba markdup documentation It is faster than samtools or Picard for removing duplicates, as it is both multithreaded (which Picard is not) and happens in a single command (samtools markdup requires several steps to finally remove duplicate reads)

Example command:

(sambamba)
 ✘  Wed 17 Jun - 08:31  ~/data_sm/tabula-microcebus/analyses/de-novo-orthology/uTAR 
 olga@tesla  time sambamba markdup --show-progress --hash-table-size=999999999 --remove-duplicates --nthreads 96 combined_bam_all_2.sorted.bam combined_bam_all_2.sambamba_remove_duplicates.bam

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
[==============================================================================]
  sorted 0 end pairs
     and 18896280915 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 27042899 ms
  found 18482761503 duplicates
collected list of positions in 2127 min 31 sec
removing duplicates...
[==============================================================================]
collected list of positions in 3412 min 10 sec
sambamba markdup --show-progress --hash-table-size=999999999  --nthreads 96    716187.58s user 1029309.94s system 852% cpu 56:52:18.32 tota

Also use --sort-buffer-size to set memory limits:

         --sort-buffer-size=SORT_BUFFER_SIZE
                    total amount of memory (in *megabytes*) used for sorting purposes;
                    the default is 2048, increasing it will reduce the number of created
                    temporary files and the time spent in the main thread
olgabot commented 4 years ago

Also to index:

sambamba)
 ✘  Mon 22 Jun - 08:12  ~/data_sm/tabula-microcebus/analyses/de-novo-orthology/uTAR 
 olga@tesla  sambamba index --nthreads 120 --show-progress combined_bam_all_2.sambamba_remove_duplicates.bam

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

[==============================================================================]
phoenixAja commented 4 years ago

@olgabot so will we also need a sambamba step to index after removing duplicates? What is the sambamba index doing exactly/ why do we need to index? Is it just for performance reasons? Or is it that downstream tools require an index file?

phoenixAja commented 4 years ago

additionally do we want to have a parameter that flags whether we want to do this step? Or should we just do this step automatically with every run?

olgabot commented 4 years ago

@olgabot so will we also need a sambamba step to index after removing duplicates? What is the sambamba index doing exactly/ why do we need to index? Is it just for performance reasons? Or is it that downstream tools require an index file?

The second - downstream tools require an index (.bai) file. E.g. the command samtools view in the process samtools_view_fastq requires an index for random access of a genomic region

additionally do we want to have a parameter that flags whether we want to do this step? Or should we just do this step automatically with every run?

Hmm.. I'm of the opinion that it should happen for every bam, since duplicate reads do nothing for me in this pipeline, except add time to the compute job. But we should probably have a flag like --skip_remove_duplicates_bam to skip it :)

olgabot commented 4 years ago

Re what is the index anyway, here's the explanation from samtools:

Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access. Note for SAM this only works if the file has been BGZF compressed first.

This index is needed when region arguments are used to limit samtools view and similar commands to particular regions of interest.

Basically, if you want to jump to any genomic coordinates, then an Index/.bai file is required. Eg. samtools view alignment.bam chr1:100-200 would raise an error if an alignment.bam.bai file did not exist

Hope that helps!

phoenixAja commented 4 years ago

thanks for the explanation @olgabot! I like the --skip_remove_duplicates_bam flag for skipping, I'll add that in!

phoenixAja commented 4 years ago

@olgabot so thinking about this a bit more i wanted to iron out the following usecases, let me know if i'm on the right track:

if (params.bam && params.bai && params.skip_remove_duplicates_bam) -> skip sambamba if (params.bam && params.bai && !params.skip_remove_duplicates_bam) -> do sambamba step first, also create new index file with sambamba index (or should we skip sambamba index since we already have an index file? ) if (params.bam && params.skip_remove_duplicates_bam) -> exit, no bai index file supplied if (params.bam && !params.bai && !params.skip_remove_duplicates_bam) -> throw warning no bai file supplied, but doesn't really matter since we'll be creating a new index file anyways

let me know if there's any use cases i missed as well!

olgabot commented 4 years ago

if (params.bam && params.bai && params.skip_remove_duplicates_bam) -> skip sambamba

Yep!

if (params.bam && params.bai && !params.skip_remove_duplicates_bam) -> do sambamba step first, also create new index file with sambamba index (or should we skip sambamba index since we already have an index file? )

Yep, we'll need to create a new index file because the index is basically "to see chr3, jump to line 1234". And since we've removed duplicate reads from the file, then all the line numbers will have changed

if (params.bam && params.skip_remove_duplicates_bam) -> exit, no bai index file supplied

This seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no bai then create the index? What do you think?

if (params.bam && !params.bai && !params.skip_remove_duplicates_bam) -> throw warning no bai file supplied, but doesn't really matter since we'll be creating a new index file anyways

Yep, that sounds right!

phoenixAja commented 4 years ago

if (params.bam && params.skip_remove_duplicates_bam) -> exit, no bai index file supplied

This seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no bai then create the index? What do you think?

i think we should exit since the user had the flag to --skip-duplicates-bam? I guess it depends, if we proceed with removing duplicates and creating a new index file could that lead to results that are misleading?

olgabot commented 4 years ago

if (params.bam && params.skip_remove_duplicates_bam) -> exit, no bai index file supplied

This seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no bai then create the index? What do you think?

i think we should exit since the user had the flag to --skip-duplicates-bam? I guess it depends, if we proceed with removing duplicates and creating a new index file could that lead to results that are misleading?

The results wouldn't necessarily be misleading (at least in my opinion), but they wouldn't be accurate to what the user wanted. So yeah, let's exit. Thanks!

phoenixAja commented 4 years ago

we should still ask users to supply a bed file if they want to remove duplicates right? Once we do the sambamba process we still would want to direct the output channel to the samtools_view_fastq process?

olgabot commented 4 years ago

I think there's a couple things:

  1. Add option to remove duplicates with sambamba
  2. Add option to not supply a bed file

The second thing then allows the user to do sencha translate on ALL the reads, rather than on a per-region basis. So no, supplying a bed file is not required for removing duplicates. Does that help?