alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Junction counts in STARsolo matrix.mtx do not match actual counts #2096

Open NaotoKubota opened 8 months ago

NaotoKubota commented 8 months ago

Hi Alex,

I am trying to analyze our customized 10x snRNA-seq data to see alternative splicing, but I found junction read counts stored in matrix.mtx do not match my observation of bam files in IGV.

The execution command is like below:

STAR \
--runThreadN 32 \
--soloType CB_UMI_Simple \
--genomeDir hogehoge \
--readFilesIn \
fugafuga_R2 \
fugafuga_R1 \
--readFilesCommand zcat \
--soloCBwhitelist whitelist \
--soloUMIlen 12 \
--soloBarcodeReadLength 150 \
--soloCellFilter EmptyDrops_CR \
--soloFeatures Gene GeneFull SJ \
--clipAdapterType CellRanger4 \
--outFilterScoreMin 30 \
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
--soloUMIfiltering MultiGeneUMI_CR \
--soloUMIdedup 1MM_CR \
--outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM hogehoge

For example, the junction chr10 127296634 127296872 (An exclusive junction of the skipped exon shown in the IGV image) seems to have more than 1,000 reads based on the IGV sashimi plot but actually the count in matrix.mtx is 1. The bam file shown in IGV is STARsolo output bam after UMI deduplication by UMItools. image I found many regions showing such junction count inconsistency.

I found a similar issue #2049 and understand the counts in the matrix.mtx are after UMI deduplication, but it is still unclear how the junction count matrix is created and why my data show this inconsistency.

Do you have any ideas on possible causes?

Best,

alexdobin commented 7 months ago

Hi Naoto,

matrix.mtx counts the number of UMIs uniquely mapping to this junction. So another reason for discrepancy is that there are a lot of multimapping reads that you see in the browser which are not counted in matrix.mtx.

NaotoKubota commented 7 months ago

Hi Alex,

I checked the CIGAR of reads mapped to the region and it seems all reads were uniquely mapped (the NH tags were 1). Still unclear why those reads are not counted...

fursham-h commented 6 months ago

I am also experiencing the same inconsistencies. I am testing STARsolo on a simulated SmartSeq RNA-seq format with no read deduplication, which should not remove exact reads. The total number of uniquely mapped SJ reads in the SJ.out.tab file does not tally with the total SJ reads in the Solo.out/SJ/raw/matrix.mtx file. The number of reads in the mtx file is typically 2/3 lesser than the ones in the SJ.out.tab file.

It would be nice to address this issue as STARsolo has been an amazing tool to use with single cell rnaseq datasets.