CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

one read to multiple unique_id? #493

Closed YichaoOU closed 1 year ago

YichaoOU commented 3 years ago

This read M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT is being assigned to multiple UMI groups?

command I used umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired --per-cell > test.bam

read_id contig  position        gene    umi     umi_count       final_umi       final_umi_count unique_id
M04990:85:000000000-G8PPT:1:2102:15075:12691_GATAGAGGTAAGCGTTTAGCTTGT_GTAAAGTACC        chr1    22077   NA      GTAAAGTACC      1       GTAAAGTACC      1       0
M04990:85:000000000-G8PPT:1:1101:7449:14403_GTAAGGTGGGTCGTGTGCAATCCG_GATTAAAACA chr1    171539  NA      GATTAAAACA      1       GATTAAAACA      1       1
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       2
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       3
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       4
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       5
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       6
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180733  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       7
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180855  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       8
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT        chr1    180855  NA      GGTTAGGGTT      1       GGTTAGGGTT      1       9

Is there anything wrong?

Thanks, Yichao

YichaoOU commented 3 years ago

To add to previous post, I also found the Number of reads out is higher than Input Reads

2021-11-01 16:44:50,782 INFO Reads: Input Reads: 152791, Read pairs: 152791, Read 2 unmapped: 57778
2021-11-01 16:44:50,783 INFO Number of reads out: 261131, Number of groups: 91525
IanSudbery commented 3 years ago

Difficult to tell without being able to see the original alignments. What is the result of

samtools view RNA.st.bam | grep M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT
YichaoOU commented 3 years ago

It's a multi-mapping read, but not removed by the UMI algorithm?

M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    99  chr1    180747  0   13S37M  =   180749  42  GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:1  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180747  0   13S37M  =   180755  48  GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:3  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180747  0   13S37M  =   180761  54  GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:4  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180747  0   13S37M  =   180805  98  GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:5  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180747  0   13S37M  =   180871  164 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:7  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180747  0   13S37M  =   180959  252 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:8  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    147 chr1    180749  0   40M =   180747  -42 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:1  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180755  0   40M =   180747  -48 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:3  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180761  0   40M =   180747  -54 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:4  AS:i:76 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180805  0   40M =   180747  -98 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:5  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180867  0   11S39M  =   180871  44  GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:2  AS:i:76 nM:i:1  NM:i:1  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    355 chr1    180867  0   11S39M  =   180959  132 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA  ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH  NH:i:8  HI:i:6  AS:i:75 nM:i:1  NM:i:1  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180871  0   40M =   180867  -44 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:2  AS:i:76 nM:i:1  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180871  0   40M =   180747  -164    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:7  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180959  0   40M =   180867  -132    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:6  AS:i:75 nM:i:1  NM:i:0  XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT    403 chr1    180959  0   40M =   180747  -252    CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT    F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA    NH:i:8  HI:i:8  AS:i:75 nM:i:0  NM:i:0  XS:Z:Unassigned_MultiMapping

I also found with or without --per-cell won't affect my result at all. I'm wondering why? My read header is read_name_[cell_barcode]_[UMI]. --per-cell should be able to recognize the cell barcode in the middle, right?

Thanks, Yichao

IanSudbery commented 3 years ago

No, different alignments of the same read are treated as independent reads. This is a deliberate design choice, as different applications require different approaches to multimapping reads, so we leave it to the user to process them as is approprate for their aims. The simplest thing to do is to use samtools to filter the input BAM so that it only contains primary alignments, or to filter out multimapping reads entirely on the NH tag.

For droplet-based single cell RNA-seq, alevin actaully uses the UMI information to try to work out what this the mostly likely "correct" alignment of the read.

--per-cell should definately be able to recognize the cell barcode in the middle. When you say --per-cell doesn't affect your results, do you mean that the output from the command is identical with or without --per-cell? Thats definately not how it should be.

YichaoOU commented 3 years ago

Thank you again for all the information!

The output read order is not the same, but once sorted , the md5sum is the same.


umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired --per-cell --random-seed 1 > test.bam
umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired  --random-seed 1 > test2.bam

samtools view test.bam | cut -f 1 | sort | md5sum
f8cd77d702ce64aa76d3a6f6955a37f0  -
samtools view test2.bam | cut -f 1 | sort | md5sum
f8cd77d702ce64aa76d3a6f6955a37f0  -

Another question for:

2021-11-01 16:44:50,782 INFO Reads: Input Reads: 152791, Read pairs: 152791, Read 2 unmapped: 57778
2021-11-01 16:44:50,783 INFO Number of reads out: 261131, Number of groups: 91525

How is the numbers of input reads are calculated? I found input reads did not match samtools flagstat but output reads did match.

samtools flagstat test.bam
261131 + 0 in total (QC-passed reads + QC-failed reads)
130106 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
261131 + 0 mapped (100.00% : N/A)
131025 + 0 paired in sequencing
51408 + 0 read1
79617 + 0 read2
102816 + 0 properly paired (78.47% : N/A)
102816 + 0 with itself and mate mapped
28209 + 0 singletons (21.53% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat RNA.st.bam
318909 + 0 in total (QC-passed reads + QC-failed reads)
167148 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
318909 + 0 mapped (100.00% : N/A)
151761 + 0 paired in sequencing
72144 + 0 read1
79617 + 0 read2
102816 + 0 properly paired (67.75% : N/A)
102816 + 0 with itself and mate mapped
48945 + 0 singletons (32.25% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
IanSudbery commented 2 years ago

Hmmm.... Something is very odd about that RNA.st.bam file - the numbers for read1 + read2 + singletons does not add up to the total number of reads.