Closed olgabot closed 4 years ago
Also to index:
sambamba)
✘ Mon 22 Jun - 08:12 ~/data_sm/tabula-microcebus/analyses/de-novo-orthology/uTAR
olga@tesla sambamba index --nthreads 120 --show-progress combined_bam_all_2.sambamba_remove_duplicates.bam
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)
[==============================================================================]
@olgabot so will we also need a sambamba step to index after removing duplicates? What is the sambamba index doing exactly/ why do we need to index? Is it just for performance reasons? Or is it that downstream tools require an index file?
additionally do we want to have a parameter that flags whether we want to do this step? Or should we just do this step automatically with every run?
@olgabot so will we also need a sambamba step to index after removing duplicates? What is the sambamba index doing exactly/ why do we need to index? Is it just for performance reasons? Or is it that downstream tools require an index file?
The second - downstream tools require an index (.bai
) file. E.g. the command samtools view
in the process samtools_view_fastq
requires an index for random access of a genomic region
additionally do we want to have a parameter that flags whether we want to do this step? Or should we just do this step automatically with every run?
Hmm.. I'm of the opinion that it should happen for every bam
, since duplicate reads do nothing for me in this pipeline, except add time to the compute job. But we should probably have a flag like --skip_remove_duplicates_bam
to skip it :)
Re what is the index anyway, here's the explanation from samtools:
Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access. Note for SAM this only works if the file has been BGZF compressed first.
This index is needed when region arguments are used to limit samtools view and similar commands to particular regions of interest.
Basically, if you want to jump to any genomic coordinates, then an Index/.bai
file is required. Eg. samtools view alignment.bam chr1:100-200
would raise an error if an alignment.bam.bai
file did not exist
Hope that helps!
thanks for the explanation @olgabot! I like the --skip_remove_duplicates_bam
flag for skipping, I'll add that in!
@olgabot so thinking about this a bit more i wanted to iron out the following usecases, let me know if i'm on the right track:
if (params.bam && params.bai && params.skip_remove_duplicates_bam)
-> skip sambamba
if (params.bam && params.bai && !params.skip_remove_duplicates_bam)
-> do sambamba step first, also create new index file with sambamba index
(or should we skip sambamba index
since we already have an index file? )
if (params.bam && params.skip_remove_duplicates_bam)
-> exit, no bai index file supplied
if (params.bam && !params.bai && !params.skip_remove_duplicates_bam)
-> throw warning no bai file supplied, but doesn't really matter since we'll be creating a new index file anyways
let me know if there's any use cases i missed as well!
if (params.bam && params.bai && params.skip_remove_duplicates_bam)
-> skip sambamba
Yep!
if (params.bam && params.bai && !params.skip_remove_duplicates_bam)
-> do sambamba step first, also create new index file withsambamba index
(or should we skipsambamba index
since we already have an index file? )
Yep, we'll need to create a new index file because the index is basically "to see chr3
, jump to line 1234". And since we've removed duplicate reads from the file, then all the line numbers will have changed
if (params.bam && params.skip_remove_duplicates_bam)
-> exit, no bai index file supplied
This seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no bai
then create the index? What do you think?
if (params.bam && !params.bai && !params.skip_remove_duplicates_bam)
-> throw warning no bai file supplied, but doesn't really matter since we'll be creating a new index file anyways
Yep, that sounds right!
if (params.bam && params.skip_remove_duplicates_bam)
-> exit, no bai index file suppliedThis seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no
bai
then create the index? What do you think?
i think we should exit since the user had the flag to --skip-duplicates-bam
? I guess it depends, if we proceed with removing duplicates and creating a new index file could that lead to results that are misleading?
if (params.bam && params.skip_remove_duplicates_bam)
-> exit, no bai index file suppliedThis seems right but also somewhat unfair since post-duplicate removal bam gets a brand new index file anyway... maybe if there's no
bai
then create the index? What do you think?i think we should exit since the user had the flag to
--skip-duplicates-bam
? I guess it depends, if we proceed with removing duplicates and creating a new index file could that lead to results that are misleading?
The results wouldn't necessarily be misleading (at least in my opinion), but they wouldn't be accurate to what the user wanted. So yeah, let's exit. Thanks!
we should still ask users to supply a bed file if they want to remove duplicates right? Once we do the sambamba
process we still would want to direct the output channel to the samtools_view_fastq
process?
I think there's a couple things:
The second thing then allows the user to do sencha translate
on ALL the reads, rather than on a per-region basis. So no, supplying a bed file is not required for removing duplicates. Does that help?
This pipeline can be very slow for large bam files, which may contain many duplicate reads, reads that align to exactly the same position in the genome and thus likely have the same sequence.
Sambamba is a single-command and multithreaded way to remove duplicate reads from a bam file. Sambamba markdup documentation It is faster than
samtools
orPicard
for removing duplicates, as it is both multithreaded (whichPicard
is not) and happens in a single command (samtools markdup
requires several steps to finally remove duplicate reads)Example command:
Also use
--sort-buffer-size
to set memory limits: