Closed yinzhuo0322 closed 9 months ago
Hi Yingchou Hu,
The reads to show are all read2. Deduplication is based on the alignment positions of read1 in each pair, plus the insert size (the TLEN attribute). Thus the position column 3 of the group output will be the alignment position of read1 in each pair. Sorry for the confusion.
Ian
Hi Ian,
Thanks for your quick and nice reply!
However, rows 5 to 8 all correspond to "read1" because they have the flag 83 (indicating "containing first in pair (0x40)").
M00672:435:000000000-CR3YJ:1:2103:5315:16613_AAAGTTAT 83 chr1 114746257 255 56M = 114746158 -155 CGACCAATATAAAGTACTGAGTGATTACACTCCTTGATTCTGTAGACTGAAGTTTT :*5**;9;*00D;5**)8**8;*9;;;885==8:0*;00)*9?B=+ACBCF=;+CF NH:i:1 HI:i:1 AS:i:114 NM:i:MD:Z:56 XS:Z:Assigned XN:i:1 XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1107:4120:7087_AAAGTTAT 83 chr1 114746273 255 13S40M = 114746026 -287 CCAATATAAAGGCCTGAGTGACTAAACTCCTTGATTCTGTAGACTGACGTTTT 5>;9;0**80))))50**7D;*;*))1)0*D<EEDE=:9*E9;FFD;*87D?D NH:i:1 HI:i:1 AS:i:90 NM:i:3 MD:Z:8T2C22A5 XS:Z:Assigned XN:i:1 XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1109:17856:20083_AAAGTTAT 83 chr1 114746289 255 24M = 114746138 -175 CTTGATTCTGTAGACTGAGGTTTT >5FEDBFFFEBFEFFFFFFFGEFG NH:i:1 HI:i:1 AS:i:79 NM:i:1 MD:Z:18A5 XS:Z:Assigned XN:i:1 XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:2107:14779:22843_AAAGTTAT 83 chr1 114746299 255 14M = 114745953 -360 TAGACTGAAGTTTED93*BFFFDFFGD NH:i:1 HI:i:1 AS:i:71 NM:i:0 MD:Z:14 XS:Z:Assigned XN:i:1 XT:Z:ENSG00000009307.17
Nevertheless, it appears that they do not share the same alignment start, and their values are not consistent with the value 114746312 in column 3.
Sorry to bother you again, but I'm still eager to understand where I might have made a mistake.
And I also want to know that if two read pairs map to the same gene, and have the same UMI and the same alignment position of read1, but the insert size is different, would they be considered as PCR duplication? If not, how can I just take the UMI and gene into account? Because my protocol is only fragmenting at 3' end after PCR amplification.
Thanks again for your attention and patience! Looking forward to your reply!
Best regards,
Yingchuo Hu
First, let me appologise for the delay in getting back to you. I am very busy with teaching at this time of year.
I missed originally that you are running with --per-gene
. When we run with --per-gene
, the pos
column is not used in deciding whether or not a read is a duplicate or not, only which gene a read aligns to (actaully, only which gene read1 aligns to - you can set featreCounts to exclude pairs which map to more than one gene).
As for the group output.... The pos column in the group output is calcualted as the genome position that the most 5' base of the read aligns to, not the most 5' base in the genome that is aligned to the read. These differ most when the read is aligned to the reverse strand. The difference can be confusing but consider.
Read: 3' <<<<<<<<<< 5'
Genome: 5' ------------------------------ 3'
Bamfile Pos: ^(5' most base of the genome the read aligns to)
umitools pos: ^ (genome base that the 5' most base of the read aligns to)
The 5' most base for the reads (which are all on the reverse strand) you show are: first read: 114746257 + 56 = 114746313 second read: 116746273 + 40 =114746313 third read: 114746289 + 24 = 114746313 fourth read: 114746299 + 14 = 114746313
Because of the zero-based, half-open nature of genome coordinates, we need to substract one from these to find the base the 5' most base of the read aligns to.
But, as I said, pos isn't used to determine duplicates if we run with --per-gene
, only the identity of the gene aligned to.
Hi Ian,
I am glad to receive your reply again, it's ok, you must be busy with lots of things and you deserve to take a well-earned break!
Thanks a lot for such a patient and clear explanation for me! Now, I have fully understood that if we run with --per-gene
, only the reads which have the same UMI and mapped to the same gene will be considered as the "duplication"; and the pos column in the group output is not the reads start in the bam file! Without your help, maybe I never notice the details of bam file, it is so tricking! And I am so inspired by you and will try more effort in understanding the logic and detail in data analysis.
Sincerely,
Yingchuo Hu
hello,
A wonderful toolkit, and I appreciate it very much!
I have looked through the documents, and found that column 3 "position " means Alignment position. Note that this position is not the start position of the read in the BAM file but the start of the read taking into account the read strand and cigar. But I still have a question.
I ran this command
to group reads based on their UMI and mapping coordinates and gene. (If reads have the same UMI and mapped the same gene, then they are the PCR duplication? ( fragmentation after PCR amplification) Regardless of the mapping coordinates?)
And here are some results,
As you can see, the first reads have the same UMI and mapped the same gene, so they are actually the same molecule, so the real count is one when I ran the umi-tools count function and the result is as below,
But, when I checked the input bam, I found the four reads don't have the same start.
So, I am very confused with the column 3 "position " of the group output. The four reads have the same column 3, which means they have the same alignment start, but in the bam, the alignment start of the four reads is different. I wonder if I misunderstood something. I am not sure whether I have expressed my thoughts clearly. Please let me know if you require any additional information to assist with your response.
Thanks for your attention and patience! Looking forward to your reply!
Best regards,
Yingchuo Hu