Bismark silently outputs incorrect results when UMIs are added using Illuminas bcl-convert

lars-work-sund commented 2 months ago

This is not really a bug as the documentation clearly states how deduplicate_bismark expects UMIs to be handled, but it is an easy mistake to make. As documented in deduplicate_bismark, Bismark expects UMIs of the form: span>@</spanA00001:001:HN2F7DRX1:1:1101:1452:1000 1:N:0:AATGACGC:CAAGAG But if Illuminas bcl-convert is used with OverrideCycles to handle UMIs, the read ID looks like this span>@</spanA00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG 1:N:0:AATGACGC The UMI is highlighted in bold. This means the sample index is used as a UMI, and no warning or error is emitted.

I propose running a pre-flight check to detect this scenario, and potentially to support the UMI location chosen by Illumina.

EDIT: I might have been completely off. I'll close it for now.

FelixKrueger commented 2 months ago

Thanks Lars, I'll await further developments. Hope all is well?

lars-work-sund commented 2 months ago

Hi Felix, I'm doing well, I hope you are too :)

I double checked and this is an issue. Tools like Illuminas bcl-convert and umi-tools places the UMI like this (umi-tools by default uses _ instead of : for separation): span>@</spanA00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG 1:N:0:AATGACGC Normally bowtie2 (and other aligners it seems) drops everything after the space, so the corresponding sam record ID would be: A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG

The function fix_IDs https://github.com/FelixKrueger/Bismark/blob/37e2cad18621c2619a9e02d1a69fdfec1819ed23/bismark#L6207 replaces spaces with underscores so the sam record ID is A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG_1:N:0:AATGACGC This causes index to be treated as the UMI. No warning or error is given (either by deduplicate_bismark or umi-tools dedup) and the estimated number of duplicates is massively inflated.

This is extra problematic as this means the workflow

using umi-tools to extract UMIs
aligning with bismark
deduplicating with umi-tools

will also cause this problem.

Here is an example:	UMI placement method	% Unique reads
Manually placed at the end	77.33%
Added by bcl-convert	5.69%
Added by bcl-convert + magical `--icpc` flag	77.34%

FelixKrueger commented 1 month ago

Hi Lars,

sorry for the late reaction this issue. I can completely see how this issue arises, and that the behaviour you describe is not desirable. I was hoping to find a good suggestion of how to handle this so that users are aware of it...

Just to quickly mention that currently, deduplicate_bismark can't really be accused of misbehaving as the default deduplication behaviour doesn't take any UMI into account, so neither the CAAGAG nor AATGACGC in the above examples. The option --barcode states:

The barcode needs to be the last element of the read ID and separated by a ':'

I can currently think of a few different actions we could take:

First, we could add some extra warning text to the option --barcode for deduplicate_bismark mentioning the behaviour of bcl-convert
We could add a regex test when deduplicate_bismark is run in UMI-mode, to check for the occurrence a DNA-sequence, followed by its typical output of read number, pass filter and mismatches (?)? For :CAAGAG_1:N:0:AATGACGC maybe something like this: /:[CAGTN]+_\d:N:\d:[CAGTN]+$/, and either die or warn about this?
We could potentially also add another option that will use the first sequence (here CAAGAG) as UMI?
we could provide a post-processing script that move the UMI to the end of the read so that it works again?
I realise that none of these measures makes users of UMI-tools aware of a potential issue. Maybe we could add an extra FAQ question to https://felixkrueger.github.io/Bismark/faq/ to raise awareness?

What are your thoughts on this?

lars-work-sund commented 1 month ago

Hi Felix

I agree, deduplicate_bismark does exactly what it promises, it just happens that the expected behavior interacts poorly with how Illumina does things.

I like all the options you suggest, but I think options 2 and 3 are the most relevant. A warning/error that it looks like you are deduplicating based on index sequence would probably have saved me. I prefer an error, since warnings are more easily missed/ignored but I can also see how that could be annoying. Is there some way to check if the first N reads all have this pattern and an identical "UMI"? Then it surely must be an error? As for option 3, it is already implemented as far as I can see: the --icpc flag (from #236) skips the replacing of " " with an "_" in the read name and results in correct deduplication, you could promote that flag as a fix to this issue.

FelixKrueger commented 1 month ago

Thanks for the comments. For 2), yes we can check the first say 1000 lines and see if all 'UMIs' are the same (as in the sequencing index), and then die.

The --icpc flag would be an option, however it will require reprocessing. Still, adding a check for the UMI/index and adding additional documentation is probably most feasible.

lars-work-sund commented 1 month ago

Which part of using the --icpc flag requires reprocessing? In my tests the analysis works fine when just adding the --icpc flag, this is of course assuming the UMI is in the place where Illumina puts it.

FelixKrueger commented 1 month ago

Yea I meant if you had it processed already without --icpc, it would potentially be fixable by transferring the UMI during the deduplication step, or alternatively reprocessing with --icpc.

FelixKrueger / Bismark

Bismark silently outputs incorrect results when UMIs are added using Illuminas bcl-convert #699