alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

STARsolo counting on overlapping features #847

Closed crazyhottommy closed 4 years ago

crazyhottommy commented 4 years ago

Hi Alex, I want to test STARsolo for my 10x runs. I have a question on counting the reads on overlapping features. How STARsolo handle this?

I heard that cellranger discard reads that fall into the overlapping features. This is problematic.

Someone on twitter says "With scRNA nuclei data we've noticed that cellranger's simplified way of dealing with introns (filling between exons) yields many overlapping genes, which it handles very poorly in the counting step: discarding ALL reads for both genes. Kallisto handles this better"

https://twitter.com/KLichtenberg2/status/1231856314852225024

For example, The Kiss1 gene in the mm10 genome is overlapping with Golt1a and Gm28040 gene. We got no cells has Kiss1 and Golt1a counts but some have Gm28040 counts.

Screen Shot 2020-02-25 at 10 37 19 AM

I want to know how STARsolo deals with this situation. Thanks!

alexdobin commented 4 years ago

Hi Ming,

STARsolo follows CellRanger's logic and does not count reads that overlap multiple genes. This is a conservative approximation since it's impossible to confidently assign such multigene reads to one specific gene. I do not know how kb-python counts intronic reads in the single-nuclei data. The original bioRxiv paper does not describe it - actually the default option in kallisto/bustools is not to count the multigenic reads. Including multimapping reads requires a maximum likelihood estimation, and it's also an approximation which may work well in some cases and not so well in others.

Looking at your example, it seems that Kiss1 and Golt1a do not have a lot of overlapping exons, and Kiss1 and Gm28040 share exactly the same CDS. Also, the details page calls this gene KISS1_MOUSE, so I think this is some sort of misannotation. So, in this case, the best approach would be to simply rename Gm28040 tas Kiss1. :) Also, with the 3' ends so similar between the two genes, MLE won't be able to confidently deconvolve their expression.

Cheers Alex

crazyhottommy commented 4 years ago

Thanks Alex! this is good information. "actually the default option in kallisto/bustools is not to count the multigenic reads." This behavior is the same as cellranger and STARslolo? I will take a closer look then.

For single-nuclei data, discarding the reads in overlapping features of different genes can be more problematic. cellranger counts the reads in the exons + introns. If discarding the reads in overlapping features, the gene with exon overlapping a different gene's intron will have counts greatly reduced.

For STARsolo, the GeneFull mode is for single-nuclei right? and it is the same as what cellranger does? Thanks!

For the Kiss1 gene, I agree it is very difficult to assign reads even with MLE. If I import the gtf from 10x reference to IGV, it is even more complicated...

Screen Shot 2020-02-26 at 11 58 44 PM
alexdobin commented 4 years ago

Hi Ming

"actually the default option in kallisto/bustools is not to count the multigenic reads." This behavior is the same as cellranger and STARslolo? I will take a closer look then.

Just to be clear, this is true about the kallisto|bustools pipeline as described in the biorxiv paper. I am not sure how the kb-python pipelines works, and especially for the nuclei data (including introinic alignments).

For single-nuclei data, discarding the reads in overlapping features of different genes can be more problematic. cellranger counts the reads in the exons + introns. If discarding the reads in overlapping features, the gene with exon overlapping a different gene's intron will have counts greatly reduced.

That is definitely true, however, there is no simple way to assign such reads to the right gene. The 10X priming is very unspecific with a lot of occurring not at the expected polyA tail, but rather at any locus with a decent enrichment of As (as noted in the RNA-velocity paper). Simple ML models cannot really deal with it. That's why I think that the conservative approach - of not assigning such reads at all is - in general - a safer way to go.

For STARsolo, the GeneFull mode is for single-nuclei right? and it is the same as what cellranger does? Thanks!

Correct!

For the Kiss1 gene, I agree it is very difficult to assign reads even with MLE. If I import the gtf from 10x reference to IGV, it is even more complicated...

Right... I am thinking of implementing some sort of simple "aggregation" of multiple genes with very similar TTS into one supergene, and then counting the reads for supergenes. I think such an approach would resolve most of the cases that are similar to your example.

Cheers Alex

crazyhottommy commented 4 years ago

Thanks Alex! this is really useful information. I like the supergene idea. collapse the genes with very similar TES (typo?) into one gene and call it with gene1;gene2 so it is not lost in the counting matrix.

FYI, there is a superTranscript paper :) https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1284-1

alexdobin commented 4 years ago

TTS=Transcript Termination Site?

Not sure if supergene is a good name... SuperTranscript is a great concept, we are actually using it as a basis for the splice-graph in a different project (long-read mapping).

crazyhottommy commented 4 years ago

oh, TES is the transcription end site :) the same as TTS. thanks Alex for answering my questions. closing it now.