alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 495 forks source link

Solo.out.1/SJ/raw/matrix.mtx has less features than features.tsv as well as SJ.out.tab #1818

Open fabotao opened 1 year ago

fabotao commented 1 year ago

We process data GSE115469 using STAR with the following command. However, the Solo.out.1/SJ/raw/matrix.mtx has less rows (features) compared with the features.tsv, thus we cannot annotate the SJ matrix by hand as well as by MARVEL program.

STAR --runThreadN 16 \ --genomeDir refdata-cellranger-GRCh38-3.0.0 \ --soloType CB_UMI_Simple \ --readFilesIn SRR9008752_possorted_genome_bam.bam \ --readFilesCommand samtools view -F 0x100 \ --readFilesType SAM SE \ --soloInputSAMattrBarcodeSeq CR UR \ --soloInputSAMattrBarcodeQual CY UY \ --soloCBwhitelist 737K-august-2016.txt \ --soloFeatures Gene SJ

The dimension of the Solo.out.1/SJ/raw/matrix.mtx is 159339 x 737280 (159339 features) whereas the dimension of the features.tsv is 188538 x 9 with 188538 features

Looking forward to your reply!

alexdobin commented 1 year ago

Hi @fabotao

the matrix.mtx is a sparse matrix and may not contain all junctions in the features.tsv, specifically the junctions detected for reads without correct barcodes/umi, and junctions detected only with multimapping reads.

fabotao commented 1 year ago

Thanks for your reply. It seems that the sparse matrix lose some rows. Do you have any suggestions on how to read this sparse matrix in combination with features.tsv, barcodes.tsv? Thanks a lot!

alexdobin commented 1 year ago

You can use standard tools, but you may need to modify the features.tsv file for splice junctions, combine the first 3 columns (chr, start, end) together separating them by underscores, to create unique splice junction ids.