Closed YichaoOU closed 1 year ago
To add to previous post, I also found the Number of reads out
is higher than Input Reads
2021-11-01 16:44:50,782 INFO Reads: Input Reads: 152791, Read pairs: 152791, Read 2 unmapped: 57778
2021-11-01 16:44:50,783 INFO Number of reads out: 261131, Number of groups: 91525
Difficult to tell without being able to see the original alignments. What is the result of
samtools view RNA.st.bam | grep M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT
It's a multi-mapping read, but not removed by the UMI algorithm?
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 99 chr1 180747 0 13S37M = 180749 42 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:1 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180747 0 13S37M = 180755 48 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:3 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180747 0 13S37M = 180761 54 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:4 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180747 0 13S37M = 180805 98 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:5 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180747 0 13S37M = 180871 164 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:7 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180747 0 13S37M = 180959 252 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:8 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 147 chr1 180749 0 40M = 180747 -42 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:1 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180755 0 40M = 180747 -48 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:3 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180761 0 40M = 180747 -54 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:4 AS:i:76 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180805 0 40M = 180747 -98 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:5 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180867 0 11S39M = 180871 44 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:2 AS:i:76 nM:i:1 NM:i:1 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 355 chr1 180867 0 11S39M = 180959 132 GTGTATAAGAGACAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ACBCCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHH NH:i:8 HI:i:6 AS:i:75 nM:i:1 NM:i:1 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180871 0 40M = 180867 -44 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:2 AS:i:76 nM:i:1 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180871 0 40M = 180747 -164 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:7 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180959 0 40M = 180867 -132 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:6 AS:i:75 nM:i:1 NM:i:0 XS:Z:Unassigned_MultiMapping
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT 403 chr1 180959 0 40M = 180747 -252 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT F00AB0EF0AB0FFGCGEHHHGGEHHHHGGGFBA1BBBBA NH:i:8 HI:i:8 AS:i:75 nM:i:0 NM:i:0 XS:Z:Unassigned_MultiMapping
I also found with or without --per-cell
won't affect my result at all. I'm wondering why? My read header is read_name_[cell_barcode]_[UMI]
. --per-cell
should be able to recognize the cell barcode in the middle, right?
Thanks, Yichao
No, different alignments of the same read are treated as independent reads. This is a deliberate design choice, as different applications require different approaches to multimapping reads, so we leave it to the user to process them as is approprate for their aims. The simplest thing to do is to use samtools to filter the input BAM so that it only contains primary alignments, or to filter out multimapping reads entirely on the NH tag.
For droplet-based single cell RNA-seq, alevin
actaully uses the UMI information to try to work out what this the mostly likely "correct" alignment of the read.
--per-cell
should definately be able to recognize the cell barcode in the middle. When you say --per-cell
doesn't affect your results, do you mean that the output from the command is identical with or without --per-cell
? Thats definately not how it should be.
Thank you again for all the information!
The output read order is not the same, but once sorted , the md5sum is the same.
umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired --per-cell --random-seed 1 > test.bam
umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired --random-seed 1 > test2.bam
samtools view test.bam | cut -f 1 | sort | md5sum
f8cd77d702ce64aa76d3a6f6955a37f0 -
samtools view test2.bam | cut -f 1 | sort | md5sum
f8cd77d702ce64aa76d3a6f6955a37f0 -
Another question for:
2021-11-01 16:44:50,782 INFO Reads: Input Reads: 152791, Read pairs: 152791, Read 2 unmapped: 57778
2021-11-01 16:44:50,783 INFO Number of reads out: 261131, Number of groups: 91525
How is the numbers of input reads
are calculated? I found input reads
did not match samtools flagstat
but output reads
did match.
samtools flagstat test.bam
261131 + 0 in total (QC-passed reads + QC-failed reads)
130106 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
261131 + 0 mapped (100.00% : N/A)
131025 + 0 paired in sequencing
51408 + 0 read1
79617 + 0 read2
102816 + 0 properly paired (78.47% : N/A)
102816 + 0 with itself and mate mapped
28209 + 0 singletons (21.53% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat RNA.st.bam
318909 + 0 in total (QC-passed reads + QC-failed reads)
167148 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
318909 + 0 mapped (100.00% : N/A)
151761 + 0 paired in sequencing
72144 + 0 read1
79617 + 0 read2
102816 + 0 properly paired (67.75% : N/A)
102816 + 0 with itself and mate mapped
48945 + 0 singletons (32.25% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Hmmm.... Something is very odd about that RNA.st.bam
file - the numbers for read1 + read2 + singletons does not add up to the total number of reads.
This read
M04990:85:000000000-G8PPT:1:1101:25236:14430_TACAGGATTAGCTTGTTAGAACAC_GGTTAGGGTT
is being assigned to multiple UMI groups?command I used
umi_tools group -I RNA.st.bam --group-out=grouped.tsv --output-bam --log=group.log --paired --per-cell > test.bam
Is there anything wrong?
Thanks, Yichao