Closed pontushojer closed 4 years ago
I am not sure these changes does exactly what we want.
If there are two reads with different barcodes at one position they will be marked as duplicates. However if there is no other proximal occurrence of this, their barcodes will not be merged and both reads should be kept.
Not sure how relevant this is here, but I wanted to mention a discussion I had in another project. In that project, the reads have UMIs and we need to use UMI-aware duplicate marking programs. There are at least
Also, note that downstream tools (in particular variant callers) will usually ignore reads marked as duplicates anyway, so whether you remove them in the filterclusters
script or not doesn’t matter too much. However, this also means that the duplicate marking needs to be done correctly, and perhaps differently than now.
If there are two reads with different barcodes at one position they will be marked as duplicates. However if there is no other proximal occurrence of this, their barcodes will not be merged and both reads should be kept.
@FrickTobias This is true. I only considered the case where barcodes were actually merged in which case duplicates can be remove directly. In your case however they must be reclassified. The only way to do that however, what I know of, is to introduce a second duplicate calling step (using perhaps Marcels suggested tools) that is barcode-aware. We had this earlier using picard MarkDuplicates
but removed it due to long runtimes as I recall.
One thing more i noticed is that the duplicate calling prior to the clusterrmdup
step might not be as useful as we though. I did a preliminary test on a non-duplcliate-marked BAM and saw no difference in cluster merges compared to the duplicate-marked BAM input. Perhaps we could remove this prior step and instead only incorporate the barcode-aware one.
Also, note that downstream tools (in particular variant callers) will usually ignore reads marked as duplicates anyway, so whether you remove them in the filterclusters script or not doesn’t matter too much. However, this also means that the duplicate marking needs to be done correctly, and perhaps differently than now.
@marcelm I did not know that this was the case. But as the same time you state "usually ignore reads marked as duplicates" ;) This makes me think that either we have to check each tool to see if they do so or remove duplicates to be on the safe side.
Here is the new version which I arrived at after discussion with @FrickTobias.
Changes:
clusterrmdup
. Also removed sambamba
and samblaster
.clusterrmdup
script to not confirm that duplicates positions are infact duplicate using the marks from the prior duplicate-marking step. I aslo compared the output files with and without this modification and could see no difference. I tested both on the small testfile in blr-testdata-0.3 and a larger chr22 dataset.clusterrmdup
step using picard MarkDuplicates
. Of the tools mentioned by @marcelm this was the only one that could be directly integrated into our pipeline. je MarkDupes
and UMI-tools
would require so modification to the barcode-tagging structure. Here is a filegraph of the current pipeline:
@marcelm I did not know that this was the case. But as the same time you state "usually ignore reads marked as duplicates" ;) This makes me think that either we have to check each tool to see if they do so or remove duplicates to be on the safe side.
I haven’t checked each tool myself, which is why I said "usually". But it would be strange if a variant caller did not ignore duplicates.
Perhaps I should explain why the duplicate markers don’t remove duplicates by default. One of the design aims of the BAM/CRAM format, as far as I understand, was that you can recover all the raw reads from it. So after you’ve done all the mapping, duplicate marking and other BAM manipulations, you would keep a single BAM/CRAM file and throw away the original FASTQ file. If necessary, you could convert losslessly back to FASTQ and re-run your pipeline, but in the meantime, you save a lot of space.
Probably few pipelines actually do this, but this explains a couple of properties of the BAM format:
@marcelm If it one of the aims behind the format it is probably a good idea for us to follow it, as long as it is resonable to do so. I know that me and @FrickTobias have talked about following this, but I did not know how widely this design idea was applied in general. In this case I will revert to not removing duplicates.
To me this looks good! Awesome work and interesting issue, especially since we never ran into it before.
Sorry I wasn’t entirely clear: I only wanted to illustrate why every self-respecting variant caller will definitely ignore duplicates (because that’s how BAM was meant to be used).
For the BLR pipeline in particular, I also think that there should be only one BAM in the end, and the intermediate ones should be removed. However, there is so much manipulation of reads and read names going on that it’s probably not realistic one would ever throw away the input FASTQ, so it’s actually fine to remove duplicates.
Talk to Tobias and we agreed on merging now.
Currently duplicate reads are kept in for input to downstream analysis (variant calling etc.). These should be removed and I purpose we do that in the filtering step.