CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Confused with the output "position" of group function #603

Closed yinzhuo0322 closed 9 months ago

yinzhuo0322 commented 1 year ago

hello,

A wonderful toolkit, and I appreciate it very much!

I have looked through the documents, and found that column 3 "position " means Alignment position. Note that this position is not the start position of the read in the BAM file but the start of the read taking into account the read strand and cigar. But I still have a question.

I ran this command

$umi_tools group --per-gene --gene-tag=XT -I $outdir/STAR_align.PE.filter_mct.umi/${sampleid}.${species}.rna_reads.featureCounts.sorted.bam --paired --group-out=$outdir/STAR_align.PE.filter_mct.umi/${sampleid}.${species}.rna_reads.sorted.groups.tsv --output-bam -S $outdir/STAR_align.PE.filter_mct.umi/${sampleid}.${species}.rna_reads.grouped.bam

to group reads based on their UMI and mapping coordinates and gene. (If reads have the same UMI and mapped the same gene, then they are the PCR duplication? ( fragmentation after PCR amplification) Regardless of the mapping coordinates?)

And here are some results,

(base) [node07: /gshare/xielab/huyc/workflow/TSO+RNA/Data_run/ren_20230904_miseq_TSO+RNA_K562_both_convert/ ]
$ less Output_s1.PE.filter_mct.stranded.umi/STAR_align.PE.filter_mct.umi/230829_TSOandRNA_K562_TSO_test_RNA_FT.GTCCTAAG.hg38.rna_reads.sorted.groups.tsv|grep "ENSG00000009307.17"
M00672:435:000000000-CR3YJ:1:2103:5315:16613_AAAGTTAT   chr1    114746312   ENSG00000009307.17  AAAGTTAT    4   AAAGTTAT    4   1
M00672:435:000000000-CR3YJ:1:1107:4120:7087_AAAGTTAT    chr1    114746312   ENSG00000009307.17  AAAGTTAT    4   AAAGTTAT    4   1
M00672:435:000000000-CR3YJ:1:1109:17856:20083_AAAGTTAT  chr1    114746312   ENSG00000009307.17  AAAGTTAT    4   AAAGTTAT    4   1
M00672:435:000000000-CR3YJ:1:2107:14779:22843_AAAGTTAT  chr1    114746312   ENSG00000009307.17  AAAGTTAT    4   AAAGTTAT    4   1

As you can see, the first reads have the same UMI and mapped the same gene, so they are actually the same molecule, so the real count is one when I ran the umi-tools count function and the result is as below,

(base) [node07: /gshare/xielab/huyc/workflow/TSO+RNA/Data_run/ren_20230904_miseq_TSO+RNA_K562_both_convert/ ]
$ less Output_s1.PE.filter_mct.stranded.umi/umi_count/230829_TSOandRNA_K562_TSO_test_RNA_FT.GTCCTAAG.hg38.rna_reads.sorted.counts.tsv.gz |grep ENSG00000009307.17
ENSG00000009307.17  1

But, when I checked the input bam, I found the four reads don't have the same start.

(base) [node07: /gshare/xielab/huyc/workflow/TSO+RNA/Data_run/ren_20230904_miseq_TSO+RNA_K562_both_convert/ ]
$ samtools view Output_s1.PE.filter_mct.stranded.umi/STAR_align.PE.filter_mct.umi/230829_TSOandRNA_K562_TSO_test_RNA_FT.GTCCTAAG.hg38.rna_reads.featureCounts.dedup.bam|less|grep "XS:Z:Assigned"|grep "ENSG00000009307.17"
M00672:435:000000000-CR3YJ:1:2107:14779:22843_AAAGTTAT  163  chr1  114745953  255  59M  =  114746299  360  GTTCCTTTGGGGAGAGGGGGTGGAAATCTATGCTTTGTAATTTATTACAACATCTAAGT  CCCCCGGGGGGG7FGGGGGGGGGGCFGFDFDEECFGCFFCGFFGFF?FD7:;*@0C7;D  NH:i:1  HI:i:1  AS:i:71  NM:i:0  MD:Z:59  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1107:4120:7087_AAAGTTAT  163  chr1  114746026  255  60M  =  114746273  287  AATCTGGTCTGCTTGGCTTCTACTGATTGACAATCAATTTGGAAAATGTAGCCTATTTCA  CCCCCGDGFGGGGGGEFFGGGAGGGFEE;DDFD?G,?EEGG+0@?5+0*;;D8*:*CCE7  NH:i:1  HI:i:1  AS:i:90  NM:i:1  MD:Z:46G13  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1109:17856:20083_AAAGTTAT  163  chr1  114746138  255  59M  =  114746289  175  CTTCCACTGTTTCTTCCAATTTTCTGAACACAACCATTTTTAACACTAATTGCTGTCTT  CCCCCGGGGGGGGGGGGGFGGGGGGGGFGFGGGGG=DGGGGGFF+;;*8;;=DE89=C9  NH:i:1  HI:i:1  AS:i:79  NM:i:0  MD:Z:59  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:2103:5315:16613_AAAGTTAT  163  chr1  114746158  255  60M  =  114746257  155  TTTCTGAACACAACCATTTTTAACACTAATTGCTGTCTTCTGGATAGGTTCTAAGTAATT  CCCCCGGFFGGGGGGGGGGGGGGGGGC?EGG8EFFGFG88=FFF+,@?D9=DBD7C>DDF  NH:i:1  HI:i:1  AS:i:114  NM:i:MD:Z:60  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:2103:5315:16613_AAAGTTAT  83  chr1  114746257  255  56M  =  114746158  -155  CGACCAATATAAAGTACTGAGTGATTACACTCCTTGATTCTGTAGACTGAAGTTTT  :*5**;9;*00D;5**)8**8;*9;;;885==8:0*;00)*9?B=+ACBCF=;+CF  NH:i:1  HI:i:1  AS:i:114  NM:i:MD:Z:56  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1107:4120:7087_AAAGTTAT  83  chr1  114746273  255  13S40M  =  114746026  -287  CCAATATAAAGGCCTGAGTGACTAAACTCCTTGATTCTGTAGACTGACGTTTT  5>;9;0**80))))50**7D;*;*))1)0*D<EEDE=:9*E9;FFD;*87D?D  NH:i:1  HI:i:1  AS:i:90  NM:i:3  MD:Z:8T2C22A5  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1109:17856:20083_AAAGTTAT  83  chr1  114746289  255  24M  =  114746138  -175  CTTGATTCTGTAGACTGAGGTTTT  >5FEDBFFFEBFEFFFFFFFGEFG  NH:i:1  HI:i:1  AS:i:79  NM:i:1  MD:Z:18A5  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:2107:14779:22843_AAAGTTAT  83  chr1  114746299  255  14M  =  114745953  -360  TAGACTGAAGTTTED93*BFFFDFFGD  NH:i:1  HI:i:1  AS:i:71  NM:i:0  MD:Z:14  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17

So, I am very confused with the column 3 "position " of the group output. The four reads have the same column 3, which means they have the same alignment start, but in the bam, the alignment start of the four reads is different. I wonder if I misunderstood something. I am not sure whether I have expressed my thoughts clearly. Please let me know if you require any additional information to assist with your response.

Thanks for your attention and patience! Looking forward to your reply!

Best regards,

Yingchuo Hu

IanSudbery commented 1 year ago

Hi Yingchou Hu,

The reads to show are all read2. Deduplication is based on the alignment positions of read1 in each pair, plus the insert size (the TLEN attribute). Thus the position column 3 of the group output will be the alignment position of read1 in each pair. Sorry for the confusion.

Ian

yinzhuo0322 commented 1 year ago

Hi Ian,

Thanks for your quick and nice reply!

However, rows 5 to 8 all correspond to "read1" because they have the flag 83 (indicating "containing first in pair (0x40)").

M00672:435:000000000-CR3YJ:1:2103:5315:16613_AAAGTTAT  83  chr1  114746257  255  56M  =  114746158  -155  CGACCAATATAAAGTACTGAGTGATTACACTCCTTGATTCTGTAGACTGAAGTTTT  :*5**;9;*00D;5**)8**8;*9;;;885==8:0*;00)*9?B=+ACBCF=;+CF  NH:i:1  HI:i:1  AS:i:114  NM:i:MD:Z:56  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1107:4120:7087_AAAGTTAT  83  chr1  114746273  255  13S40M  =  114746026  -287  CCAATATAAAGGCCTGAGTGACTAAACTCCTTGATTCTGTAGACTGACGTTTT  5>;9;0**80))))50**7D;*;*))1)0*D<EEDE=:9*E9;FFD;*87D?D  NH:i:1  HI:i:1  AS:i:90  NM:i:3  MD:Z:8T2C22A5  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:1109:17856:20083_AAAGTTAT  83  chr1  114746289  255  24M  =  114746138  -175  CTTGATTCTGTAGACTGAGGTTTT  >5FEDBFFFEBFEFFFFFFFGEFG  NH:i:1  HI:i:1  AS:i:79  NM:i:1  MD:Z:18A5  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17
M00672:435:000000000-CR3YJ:1:2107:14779:22843_AAAGTTAT  83  chr1  114746299  255  14M  =  114745953  -360  TAGACTGAAGTTTED93*BFFFDFFGD  NH:i:1  HI:i:1  AS:i:71  NM:i:0  MD:Z:14  XS:Z:Assigned  XN:i:1  XT:Z:ENSG00000009307.17 

Nevertheless, it appears that they do not share the same alignment start, and their values are not consistent with the value 114746312 in column 3.

Sorry to bother you again, but I'm still eager to understand where I might have made a mistake.

And I also want to know that if two read pairs map to the same gene, and have the same UMI and the same alignment position of read1, but the insert size is different, would they be considered as PCR duplication? If not, how can I just take the UMI and gene into account? Because my protocol is only fragmenting at 3' end after PCR amplification.

Thanks again for your attention and patience! Looking forward to your reply!

Best regards,

Yingchuo Hu

IanSudbery commented 1 year ago

First, let me appologise for the delay in getting back to you. I am very busy with teaching at this time of year.

I missed originally that you are running with --per-gene. When we run with --per-gene, the pos column is not used in deciding whether or not a read is a duplicate or not, only which gene a read aligns to (actaully, only which gene read1 aligns to - you can set featreCounts to exclude pairs which map to more than one gene).

As for the group output.... The pos column in the group output is calcualted as the genome position that the most 5' base of the read aligns to, not the most 5' base in the genome that is aligned to the read. These differ most when the read is aligned to the reverse strand. The difference can be confusing but consider.

Read:                    3' <<<<<<<<<< 5'
Genome:       5' ------------------------------ 3'
Bamfile Pos:                ^(5' most base of the genome the read aligns to)
umitools pos:                        ^ (genome base that the 5' most base of the read aligns to)

The 5' most base for the reads (which are all on the reverse strand) you show are: first read: 114746257 + 56 = 114746313 second read: 116746273 + 40 =114746313 third read: 114746289 + 24 = 114746313 fourth read: 114746299 + 14 = 114746313

Because of the zero-based, half-open nature of genome coordinates, we need to substract one from these to find the base the 5' most base of the read aligns to.

But, as I said, pos isn't used to determine duplicates if we run with --per-gene, only the identity of the gene aligned to.

yinzhuo0322 commented 1 year ago

Hi Ian,

I am glad to receive your reply again, it's ok, you must be busy with lots of things and you deserve to take a well-earned break!

Thanks a lot for such a patient and clear explanation for me! Now, I have fully understood that if we run with --per-gene, only the reads which have the same UMI and mapped to the same gene will be considered as the "duplication"; and the pos column in the group output is not the reads start in the bam file! Without your help, maybe I never notice the details of bam file, it is so tricking! And I am so inspired by you and will try more effort in understanding the logic and detail in data analysis.

Sincerely,

Yingchuo Hu