Update duplicate marking and handling

pontushojer commented 4 years ago

Currently duplicate reads are kept in for input to downstream analysis (variant calling etc.). These should be removed and I purpose we do that in the filtering step.

FrickTobias commented 4 years ago

I am not sure these changes does exactly what we want.

If there are two reads with different barcodes at one position they will be marked as duplicates. However if there is no other proximal occurrence of this, their barcodes will not be merged and both reads should be kept.

marcelm commented 4 years ago

Not sure how relevant this is here, but I wanted to mention a discussion I had in another project. In that project, the reads have UMIs and we need to use UMI-aware duplicate marking programs. There are at least

Picard UmiAwareMarkDuplicatesWithMateCigar,
je MarkDupes,
and possibly UMI-tools.

Also, note that downstream tools (in particular variant callers) will usually ignore reads marked as duplicates anyway, so whether you remove them in the filterclusters script or not doesn’t matter too much. However, this also means that the duplicate marking needs to be done correctly, and perhaps differently than now.

pontushojer commented 4 years ago

If there are two reads with different barcodes at one position they will be marked as duplicates. However if there is no other proximal occurrence of this, their barcodes will not be merged and both reads should be kept.

@FrickTobias This is true. I only considered the case where barcodes were actually merged in which case duplicates can be remove directly. In your case however they must be reclassified. The only way to do that however, what I know of, is to introduce a second duplicate calling step (using perhaps Marcels suggested tools) that is barcode-aware. We had this earlier using picard MarkDuplicates but removed it due to long runtimes as I recall.

One thing more i noticed is that the duplicate calling prior to the clusterrmdup step might not be as useful as we though. I did a preliminary test on a non-duplcliate-marked BAM and saw no difference in cluster merges compared to the duplicate-marked BAM input. Perhaps we could remove this prior step and instead only incorporate the barcode-aware one.

Also, note that downstream tools (in particular variant callers) will usually ignore reads marked as duplicates anyway, so whether you remove them in the filterclusters script or not doesn’t matter too much. However, this also means that the duplicate marking needs to be done correctly, and perhaps differently than now.

@marcelm I did not know that this was the case. But as the same time you state "usually ignore reads marked as duplicates" ;) This makes me think that either we have to check each tool to see if they do so or remove duplicates to be on the safe side.

pontushojer commented 4 years ago

Here is the new version which I arrived at after discussion with @FrickTobias.

Changes:

Removed barcode-unaware duplicate marking prior to clusterrmdup. Also removed sambamba and samblaster.
Modified clusterrmdup script to not confirm that duplicates positions are infact duplicate using the marks from the prior duplicate-marking step. I aslo compared the output files with and without this modification and could see no difference. I tested both on the small testfile in blr-testdata-0.3 and a larger chr22 dataset.
Added barcode-aware duplicate marking after clusterrmdup step using picard MarkDuplicates. Of the tools mentioned by @marcelm this was the only one that could be directly integrated into our pipeline. je MarkDupes and UMI-tools would require so modification to the barcode-tagging structure.

Here is a filegraph of the current pipeline:

marcelm commented 4 years ago

@marcelm I did not know that this was the case. But as the same time you state "usually ignore reads marked as duplicates" ;) This makes me think that either we have to check each tool to see if they do so or remove duplicates to be on the safe side.

I haven’t checked each tool myself, which is why I said "usually". But it would be strange if a variant caller did not ignore duplicates.

Perhaps I should explain why the duplicate markers don’t remove duplicates by default. One of the design aims of the BAM/CRAM format, as far as I understand, was that you can recover all the raw reads from it. So after you’ve done all the mapping, duplicate marking and other BAM manipulations, you would keep a single BAM/CRAM file and throw away the original FASTQ file. If necessary, you could convert losslessly back to FASTQ and re-run your pipeline, but in the meantime, you save a lot of space.

Probably few pipelines actually do this, but this explains a couple of properties of the BAM format:

There are flags to mark reads as PCR/optical duplicates or as not passing quality control.
Alignments can be soft-clipped, allowing you to recover the original read.
Unmapped reads are left in the file.

pontushojer commented 4 years ago

@marcelm If it one of the aims behind the format it is probably a good idea for us to follow it, as long as it is resonable to do so. I know that me and @FrickTobias have talked about following this, but I did not know how widely this design idea was applied in general. In this case I will revert to not removing duplicates.

FrickTobias commented 4 years ago

To me this looks good! Awesome work and interesting issue, especially since we never ran into it before.

marcelm commented 4 years ago

Sorry I wasn’t entirely clear: I only wanted to illustrate why every self-respecting variant caller will definitely ignore duplicates (because that’s how BAM was meant to be used).

For the BLR pipeline in particular, I also think that there should be only one BAM in the end, and the intermediate ones should be removed. However, there is so much manipulation of reads and read names going on that it’s probably not realistic one would ever throw away the input FASTQ, so it’s actually fine to remove duplicates.

pontushojer commented 4 years ago

Talk to Tobias and we agreed on merging now.

AfshinLab / BLR

Update duplicate marking and handling #1