alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

STARsolo parameters for CEL-seq2? #1966

Open wlkath opened 1 year ago

wlkath commented 1 year ago

Hi all,

I've been trying to realign some CEL-SeQ2 data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157504 The original alignment/counting was performed using zUMIs.

The UMI+BC is on read 1 (UMI 1-6, BC 7-12), e.g.,

@SRR12589500.16250 16250/1 CTTGTTCTGTTG + AAAAAEEEEEEE @SRR12589500.16251 16251/1 GCGTATTCACTC + AAAAAEEEEEEE

Following a previous post, I tried the options

--soloType CB_UMI_Simple --clip5pNbases 12 0 --soloUMIstart 1 --soloUMIlen 6 --soloCBstart 7 --soloCBlen 6 --soloCBwhitelist None --soloUMIdedup Exact

and this does produce cells/features, but only about 10% of the number expected from the count matrices in GSE157504.

After reading this post https://github.com/pachterlab/kallisto/issues/309 and trying this kallisto/bustools option I suspect that due to the short UMI sequences the only way to count this data properly is if UMIs are collapsed only when the associated reads map to the same gene.

Are there are STARsolo options that will skip the collapsing of the same UMIs if they fall on different features?

Thanks very much,

alexdobin commented 1 year ago

Hi Bill,

STARsolo collapses UMIs only if they have the same CB/UMI CB/Gene. It could be strand issue, please try --soloStrand Reverse.

wlkath commented 1 year ago

Hi Alex,

That indeed fixed the problem, and I feel foolish that I didn't try it myself. In my defense, this post https://github.com/alexdobin/STAR/issues/1385 suggested that --soloStrand Forward was the option to go with, and to top that off I observed that with kallisto/bustools --strand forward is the correct option to use.

But in addition clearly I don't fully understand how the UMI collapsing works. I had thought that only unique UMIs within each CB would be kept, but that can't be right --- with a 6 nucleotide UMI that only gives 4^6 = 4096 possibilities, and there are plenty of CBs with an order of magnitude more UMIs than this.

So, what's the extra piece that I'm missing? Does both the UMI and read have to match in order to be collapsed? Or is it that the UMI and the feature the read maps to has to match?

Again, thanks very much, both for the swiss army knife of aligners and for the willingness to give us pointers.

Best regards, Bill

alexdobin commented 1 year ago

Hi Bill,

sorry, I had it wrong in the post above (edited now): it should be STARsolo collapses UMIs only if they have the same CB/Gene.

bjstewart1 commented 11 months ago

@wlkath would you mind sharing the full command that works for you with celseq2 data? This would be extremely useful

wlkath commented 4 months ago

Sorry, I somehow missed the notification about your question. The data in question is from this paper: https://pubmed.ncbi.nlm.nih.gov/33438579/

I used:

STAR --runThreadN 20 --soloType CB_UMISimple --clip5pNbases 12 0 --soloBarcodeMate 1 --soloUMIstart 1 --soloUMIlen 6 --soloCBstart 7 --soloCBlen 6 --soloCBwhitelist None --soloStrand Reverse --soloUMIdedup Exact --soloFeatures Gene --genomeDir /home/kath/b1042/fb653ercc --outSAMtype None --quantMode GeneCounts --outFileNamePrefix $basename\ --readFilesCommand \"zcat\" --readFilesIn R1file R2file

I hope that helps, and again, sorry for not responding right away.