HorvathLab / NGS

Next-Gen Sequencing tools from the Horvath Lab
https://horvathlab.github.io/NGS/
MIT License
39 stars 16 forks source link

Some confusions about scReadCount #5

Closed XUEbaogai0101 closed 2 years ago

XUEbaogai0101 commented 2 years ago

Hello, Thank you very much for generating such convenient tools. I have met some confusions when I used scReadCounts. This is my code.

bin/scReadCounts -r sorted.bam -s IDH1.txt -o HGY1.tsv -m 5 -t 8 -b barcodes.tsv -G STARsolo

It worked successfully in a few seconds because I just wrote one SNV.

2   208248389   G   C

But when I searched the barcodes which are in the HGY1.tsv from the bam files, and match the locations, I found some conflicts. Such as the following code.

samtools view sorted.bam 2:208248362-208248589|grep 'AGCCATGCAAGAGATCTGGTGGTA'|less -S

It output just one line.

AGCCATGCAAGAGATCTGGTGGTA_ATACCAGCAATT_80081753  16      2       208243592       255     13M1714N106M2944N31M    *       0       0       ACACCACCACCTTCTTCAAAGTTATGTACCAGGTA.............

And in the output.tsv, 'AGCCATGCAAGAGATCTGGTGGTA' had one count.

CHRO|POS|REF|ALT|ReadGroup|SNVCountForward|SNVCountReverse|RefCountForward|RefCountReverse|SNVCount|RefCount|GoodReads|%BadRead|VAF
2 | 208248389 | G | C | AGCCATGCAAGAGATCTGGTGGTA | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0

Obviously, this sequence couldn't cover the location of 208248389, so how could this sequence be recognized as an SNV? Then I try another code to match the location.

samtools view sorted.bam 2:208248389|grep 'AGCCATGCAAGAGATCTGGTGGTA'|less -S

It output many lines.I pick the top 3 lines.

AGCCATGCAAGAGATCTGGTGGTA_ATACCAGCAATT_80081753  16      2       208243592       255     13M1714N106M2944N31M    *       0       0       ACACCACCACCTTCTTCAAAGTTATGTACCAGGTA..........
AGCCATGCAAGAGATCTGGTGGTA_AAGCGAACCTGC_13200871  16      2       208248639       255     22M2769N128M    *       0       0       ATTCTCTATGCCTAAATCATAGCTATGTAGATCCAATTCCACG.........
AGCCATGCAAGAGATCTGGTGGTA_AAGCGAACCTGC_71964891  16      2       208248639       255     22M2769N128M    *       0       0       ATTCTCTATGCCTAAATCATAGCTATGTAGATCCAATTCCACG........

These sequences also skipped the location of 208248389. Did I do any wrong with the samtools?Or was I misunterstanding about the sequence lenth? Thank you very much! Best XUE

edwardsnj commented 2 years ago

Here is my best guess, based on the information you have provided. Each of the alignments you've provided has a large "gap" in the alignment (as is reasonable for RNA-Seq). You can see this in the cigar string: 13M1714N106M2944N31M for the first alignment, where there are 13 matched bases, a gap of 1714 bases, 106 matching bases, a gap of 2944 bases, and 31 matching bases. scReadCounts (and the code underneath, which counts the reads) is using samtools' pileup API, which takes all of this into account and indicates the the reads that overlap each loci.

Samtools fetch API (which is what you are using for the view command) will return any read whose alignment footprint (including gaps) includes the bases provided, so those additional reads must have gaps in their alignments at that spot.

Lastly, scReadCounts is NOT recognizing SNVs (per se), it is merely extracting the counts so you can decide whether the variant is present or not.

Hope this helps.

Let me know if this explains your issues or if I haven't guessed correctly.

Cheers!

XUEbaogai0101 commented 2 years ago

Thanks for your generous answer! @edwardsnj It really deepens my understanding of samtools and scReadCounts. It is very useful! Thank you again! Cheers!