CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

bam dedup (umi_tools dedup) #555

Closed HengkuanLi closed 6 months ago

HengkuanLi commented 2 years ago

umi_tools dedup --extract-umi-method=tag --umi-tag=UB --cell-tag=CB \ --method unique \ -I input.bam -S output.bam \ --log=LOGFILE

Hello, I used the above command to deduplicate the bam file, but there are still reads in the deduplicated bam file with the same coordinates and UMI

TomSmithCGAT commented 2 years ago

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

HengkuanLi commented 2 years ago

example 1: E100047948L1C007R01003932541 0 10 103122537 3 87M4D13M 0 0 GAATTTTGCGAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGAGACA NH:i:2 HI:i:1 AS:i:86 nM:i:1 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3 E100047948L1C010R02000347603 0 10 103122537 255 89M11S 0 0 GAATTTTGCAAAATGTTGTGACAGATATTTGTCAGAATTATTGGAAATATGTTATGGTGGATGGCATTTTTTCATATAAAATTATATTTCTGTTGTGTCA NH:i:1 HI:i:1 AS:i:87 nM:i:0 UR:Z:GTGCCCTCCA CB:Z:045E30A27A GN:Z:B2M UB:Z:GTGCCCTCCA RE:A:E DB:Z:CELL1_N3

example 2: E100047948L1C042R02203017007 0 11 74998101 3 5S95M 0 0 CTATTTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAA NH:i:2 HI:i:1 AS:i:93 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3 E100047948L1C023R03300929063 0 11 74998101 3 2S98M 0 0 ATTTTTTTTTTTTTTTAATGCTGAATATTGTGATTTCTTATTTGAGATTGAAAAAGACCCACTTGAAACTGATATATGTTAGAATATATTATGTTAATAA NH:i:2 HI:i:1 AS:i:96 nM:i:0 UR:Z:TCGATGCAAG CB:Z:7D43A87E2B GN:Z:SF3B6 UB:Z:TCGATGCAAG RE:A:E DB:Z:CELL1_N3

example 3: E100047948L1C039R02503252502 0 2 65569394 255 13S87M 0 0 TGGCCTCCGACTTTTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTTGATTAACTGTTGCCTGCT NH:i:1 HI:i:1 AS:i:81 nM:i:2 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3 E100047948L1C032R02101989739 0 2 65569394 255 100M 0 0 TTCCACCCTCCTGTGCCCCCCAATGGTTTTCTTTGCAAGTGCTTTTGGAACTAAGAAGCTAGTGTCTTGGATTAACTGATGCCTGCTAGTGCTTTCTGAT NH:i:1 HI:i:1 AS:i:98 nM:i:0 UR:Z:AAACGCCCAA CB:Z:9818DBD8C1 GN:Z:ACTR3 UB:Z:AAACGCCCAA RE:A:E DB:Z:CELL1_N3

HengkuanLi commented 2 years ago

Hi @zigenLi. Would you mind posting a few examples please. I suspect they may have different soft-clipping in the CIGAR string

Hello, the above are some examples I pasted. Could you please check the problem?

TomSmithCGAT commented 2 years ago

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

TomSmithCGAT commented 2 years ago

@IanSudbery - I think we could probably do with a page/FAQ entry on the readme giving examples why two reads may appear to be duplicates by eye, but not by UMI-tools. Took me a while to realise what was going on with example 1 above. Thoughts?

IanSudbery commented 2 years ago

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

HengkuanLi commented 2 years ago

In example 1, the difference is soft-clipping at the end of read2. By default, UMI-tools assumes that > 4 bases soft-clipped is indicative of a difference in splicing, which would also mean the two reads are not duplicates. You can turn this behaviour off with --soft-clipthreshold=100. See docs here

In example 2 & 3, the first read has more bases soft-clipped at the starte, e.g example2: 5S95M vs 2S98M. UMI-tools takes account of the soft-clipping to determine the alignment start position to avoid the situation where duplicate reads have different alignments because one has low quality base calls at the start of the read which are then soft-clipped by the aligner.

On an unrelated note, your cell barcodes look strange, e.g not nucleotide sequences. Is that what you want?

Yes, the barcode sequence here is encoded.

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

HengkuanLi commented 2 years ago

@zigenLi Just checking you are aware that while you have set a cell barcode, you've not told dedup to consider it during the deduplication? (--per-cell).

@TomSmithCGAT Probably wise.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

IanSudbery commented 2 years ago

I am analyzing single-cell data. I want to deduplicate the generated bam file. The starting coordinates of the reads of the same UMI are different. Should I keep this kind of reads?

It depends on the single cell protocol you have used. We always use some sort of positional information to deduplicate reads, but what information that is depends on how the reads were generated. If fragmentation of the cDNA happens before PCR (as in protocols like SMART-seq2), then one molecule will always give rise to reads with the same alignment position, and thus the position is informative as to whether a read is unique - reads with the same UMI, but different read alignment positions will have arisen from different cDNA molecules. In these cases we use UMI and (adjusted) alignment position to deduplicate reads.

However, if PCR happens before fragmentation (like in 10X), then one molecule can give rise to reads with multiple different alignment positions and thus precise alignment position is not informative - reads with the same UMI, but different read alignment positions may have arisen from the same or different cDNA molecules. However, the alignment positions for a molecule must always be in the same transcript. Thus in these cases we use UMI and transcript/gene assignment to deduplicate reads. This mode is activated using --per-gene and requires assigning reads to genes before running dedup.

Is the parameter "--per-cell" necessary when running the above command? What would be the impact without this parameter?

--per-cell tells UMI-tools to use the cell barcode when deduplicating reads as well as the UMI (and which ever positional information is specified as above). This means that two reads with the same UMI and the same position (or the same transcript/gene assignment), but coming from different cells, will be treated as independent of each other. This is generally desirable in a single-cell context.

HengkuanLi commented 2 years ago

@TomSmithCGAT @IanSudbery Thank you very much for your help, I completed the deduplication of BAM file by adding the parameter "--per-gene".

But without parameter "--per-gene", adding parameter "--soft-clipthreshold=100" still exists reads with the same coordinates and UMI in the deduplicated bam file.