alexdobin / STAR

RNA-seq aligner
MIT License
1.79k stars 499 forks source link

BAM file generated by --quantMode TranscriptomeSAM not compatible with RSEM #1880

Open edg1983 opened 1 year ago

edg1983 commented 1 year ago

Hello,

I'm trying to use STARsolo to count genes and SJ from 10X 3' scRNA data and then I want to use RSEM for isoform quantification. Currently, I'm using STAR v2.7.10b.

First, I generated a reference dataset for STAR based on the 10X official reference genome.fa and genes.gtf files. Then I performed the alignment using STAR v2.7.10b with this command

STAR --runThreadN 12 \
    --soloType CB_UMI_Simple \
    --soloUMIlen 12 \
    --soloCellFilter EmptyDrops_CR \
    --soloFeatures GeneFull SJ \
    --genomeDir aligner_indexes/STAR/10X_2020-A_STAR-2.7.10b \
    --soloCBwhitelist 0x/bc_whitelist/3M-february-2018.txt \
    --outFileNamePrefix 10k_PBMC_3p. \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode TranscriptomeSAM \
        --clipAdapterType CellRanger4 \
        --soloMultiMappers EM \
    --readFilesIn <reads_files>

Then I generated the reference files for RSEM using the same genome.fa and genes.gtf from 10X used above.

Finally, I tried to quantify expression using

rsem-calculate-expression -p 8 --alignments 10k_PBMC_3p.Aligned.toTranscriptome.out.bam /reference_data/RSEM/10x_2020-A/GRCh38 10k_PBMC_3p

But I got this error message:

Read A00984:478:HFHCCDSX2:1:1101:5556:1000: RSEM currently does not support gapped alignments, sorry!

When I look at this read I see

A00984:478:HFHCCDSX2:1:1101:5556:1000   16  ENST00000604957 370 255 60M30S  *   0   0   ATCAGAGGTGAACACGGGATACGCGACCTGGTCATGGGAGGCGGGCGGTGAGAAGGAGAGCCCATGTACTCTGCGTTGATACCACTGCTT  FFFFFFFFFF:FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFF:FFFF,F:  NH:i:1  HI:i:1

If I interpreted this correctly, it means that this read has 30 soft-clipped bases.

However, according to STAR manual, the --quantMode TranscriptomeSAM option should produce a BAM file compatible with RSEM and without indels or soft-clipping. Indeed, the manual reports: "By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed."

Am I doing something wrong here or the STAR file toTranscriptome.out.bam simply is not compatible with RSEM?

Thanks a lot!

alexdobin commented 1 year ago

Hi Edoardo,

--clipAdapterType CellRanger4 forces clipping, so you need to omit it.

edg1983 commented 1 year ago

Great thanks!

Is there a way to generate transcript counts per cell when running STARsolo?

I can now get SJ counts per cell or use RSEM to get bulk transcript counts from the .toTrascprtome.bam file, but I can't see how to get transcript counts per cell.

Thanks for your support!

alexdobin commented 1 year ago

Hi Edoardo,

transcript quantification is in the works, but I cannot provide ETA. It's not trivial since most reads overlap more than one transcript.

edg1983 commented 1 year ago

Hi!

Conceptually I thought that I can get reads from the .toTrascprtome.bam file, split them into cell groups based on the cell barcodes (after determining cell clusters from standard scRNA matrix analysis for example), and then run RSEM on the various split to obtain per group transcripts quantification.

However, I've noticed that the CB and UB tags containing the barcode information are added to the coordSorted BAM but not to the .toTrascprtome.bam even if I've requested them in the command line.

Is there a way to have these tags also in the .toTrascprtome.bam file?

Otherwise, my assumption is that read IDs are the same between the 2 files and that I can transfer these tags from one BAM file to the other, right?

alexdobin commented 1 year ago

Hi Edoardo,

The output of CB/UB in the toTranscriptome.bam is not implemented presently. Read IDs are the same between two files, so you can transfer the tags.