alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Help interpreting how STARSolo handles expression of a single read #2022

Open jamesnemesh opened 11 months ago

jamesnemesh commented 11 months ago

Hi! I'm trying to understand some (mostly minor) differences between expression as measured by the DropSeq toolkit (of which I'm the author) and STARSolo expression outputs as we evaluate the Optimus pipeline for the BICAN project. The vast majority of our results are highly correlated, but it's helpful to dive the few genes that differ. One of the exemplar differences is gene AC010327.4, which has a sum expression across all cells of 64 UMIs from DropSeq, and 0 from STARSolo.

To explore this, I grabbed a read that looks like it should generate a count for this gene. The read is on the same strand as the gene, and aligns to the intron of that gene. This read aligns the same way in both outputs (we use STAR to align, and the reference genome is refdata-cellranger-arc-GRCh38-2020-A), so it may be that the two programs interpret the gene annotations differently. This data is from a standard 10x experiment, and uses the flag --soloStrand Forward.

From DropSeq, the read has 3 tags: gn (gene name), gs (gene strand) and gf (gene function). These are comma separated lists and can be interpreted as a group. This read tags gene AC010327.4 as an intronic read on the same strand as the read, and SYT5 as an coding read on the opposite strand from the gene. DropSeq interprets this as an intronic UMI for gene AC010327.4.

HJVH2DMXX_MayNova1:1:2401:20989:9283 0 chr19 55171443 255 91M * 0 0 TTGGCTACCAGGAAGCTTAGGAGCAAGTAAGTTTGTTTGCTGCCTCCTTTTTTGGTACGTGGGGGTAATTTTTGCCTCATTCTCCACTTAA @C453685255687647:9678747777:777::67::65864478469999956696065666668:788884*266578666)5)7:BC XC:Z:GTGGGAACACATACTG MD:Z:91 XF:Z:INTRONIC PG:Z:STAR RG:Z:HJVH2.1.2 XG:Z:SYT5,AC010327.4 NH:i:1 NM:i:0 XM:Z:ATATGCAACCAC UQ:i:0 AS:i:89 gf:Z:INTRONIC,UTR gn:Z:AC010327.4,SYT5 gs:Z:+,-

These annotations look right when I look at the GTF directly, with the read in between these two exons of AC010327.4

chr19   HAVANA  exon    55162752    55162801    .   +   .   gene_id "ENSG00000267577"; gene_version "1"; transcript_id "ENST00000591665"; transcript_version "1"; gene_type "lncRNA"; gene_name "AC010327.4"; transcript_type "lncRNA"; transcript_name "AC010327.4-201"; exon_number 4; exon_id "ENSE00002866488"; exon_version "1"; level 2; transcript_support_level "2"; tag "basic"; havana_gene "OTTHUMG00000180668.2"; havana_transcript "OTTHUMT00000452499.2";
chr19   HAVANA  exon    55177084    55177540    .   +   .   gene_id "ENSG00000267577"; gene_version "1"; transcript_id "ENST00000591665"; transcript_version "1"; gene_type "lncRNA"; gene_name "AC010327.4"; transcript_type "lncRNA"; transcript_name "AC010327.4-201"; exon_number 5; exon_id "ENSE00002790472"; exon_version "1"; level 2; transcript_support_level "2"; tag "basic"; havana_gene "OTTHUMG00000180668.2"; havana_transcript "OTTHUMT00000452499.2";

And mapping to an exon of the SYT5 gene:

chr19      HAVANA        exon    55171196    55173684           .           -           .   gene_id ENSG00000129990; gene_version 15; transcript_id ENST00000354308; transcript_version 8; gene_type protein_coding; gene_name SYT5; transcript_type protein_coding; transcript_name SYT5-201; exon_number 9; exon_id ENSE00002764071; exon_version 2; level 2; protein_id ENSP00000346265.2; transcript_support_level 1; hgnc_id HGNC:11513; tag basic; tag MANE_Select; tag appris_principal_1; tag CCDS; ccdsid CCDS12919.1; havana_gene OTTHUMG00000180669.4; havana_transcript OTTHUMT00000452501.2;

The STARSolo output for this read: HJVH2DMXX_MayNova1:1:2401:20989:9283 0 chr19 55171443 255 91M * 0 0 TTGGCTACCAGGAAGCTTAGGAGCAAGTAAGTTTGTTTGCTGCCTCCTTTTTTGGTACGTGGGGGTAATTTTTGCCTCATTCTCCACTTAA FFFFFFFFFF:FF:FFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF,F,FFFF UR:Z:ATATGCAACCAC UY:Z:FFFFFFFFFFFF CR:Z:GTGGGAACACATACTG CY:Z:FFFFFFFFFFFFFFFF NH:i:1 GX:Z:- GN:Z:- sF:B:i,2,0 CB:Z:GTGGGAACACATACTG UB:Z:ATATGCAACCAC

As far as I know, the sF:B tag is interpreted as:

image

Does the sF:B tag with a value of 2 is means the read is being interpreted as the antisense exonic version of SYT5, and is thus not counted? If so, I would think that the fact that the assay is stranded would make the read more likely to be emitted by the intronic gene than come from an unannotated antisense gene and thus not be counted. We put a significant amount of weight on the strand matching, to the point that we filter out annotation interpretations that are on the wrong strand.

Thanks for your help interpreting this!

Edit: here's the log for the run with all the parameters. I wonder if this is due to how soloFeatures is set, especially GeneFull Ex50pAS.

STAR mode is assigned
UMI LEN  12
    STAR --soloType Droplet --soloStrand Forward --runThreadN 8 --genomeDir genome_reference --readFilesIn /cromwell_root/fc-d6d96bf4-7662-4cb2-85a6-fcbf92d692b5/submissions/6a59bc60-ca5e-4a5e-abd1-b0bbc2e10bec/Optimus/0069143c-1ef6-44f5-8005-30286f8fa978/call-SplitFastq/glob-ff9a5e75bc9701a883ca32428a133843/fastq_R2_0.fastq.gz /cromwell_root/fc-d6d96bf4-7662-4cb2-85a6-fcbf92d692b5/submissions/6a59bc60-ca5e-4a5e-abd1-b0bbc2e10bec/Optimus/0069143c-1ef6-44f5-8005-30286f8fa978/call-SplitFastq/glob-5b3386a102763e9e23988d2e220cbc9d/fastq_R1_0.fastq.gz --readFilesCommand "gunzip -c" --soloCBwhitelist /cromwell_root/bican-um1-mccarroll_standard/broad/mccarroll/software/metadata/10X/3M-february-2018.txt --soloUMIlen 12 --soloCBlen 16 --soloFeatures Gene --clipAdapterType CellRanger4 --outFilterScoreMin 30 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIdedup 1MM_Directional_UMItools --outSAMtype BAM SortedByCoordinate --outSAMattributes UB UR UY CR CB CY NH GX GN sF --soloBarcodeReadLength 0 --soloCellReadStats Standard
    STAR version: 2.7.11a   compiled: 2023-08-15T11:38:34-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Dec 07 01:43:45 ..... started STAR run
Dec 07 01:44:20 ..... loading genome
Dec 07 01:46:01 ..... started mapping
Dec 07 02:08:56 ..... finished mapping
Dec 07 02:08:59 ..... started Solo counting
Dec 07 02:09:41 ..... finished Solo counting
Dec 07 02:09:41 ..... started sorting BAM
Dec 07 02:19:12 ..... finished successfully
    STAR --soloType Droplet --soloStrand Forward --runThreadN 8 --genomeDir genome_reference --readFilesIn /cromwell_root/fc-d6d96bf4-7662-4cb2-85a6-fcbf92d692b5/submissions/6a59bc60-ca5e-4a5e-abd1-b0bbc2e10bec/Optimus/0069143c-1ef6-44f5-8005-30286f8fa978/call-SplitFastq/glob-ff9a5e75bc9701a883ca32428a133843/fastq_R2_0.fastq.gz /cromwell_root/fc-d6d96bf4-7662-4cb2-85a6-fcbf92d692b5/submissions/6a59bc60-ca5e-4a5e-abd1-b0bbc2e10bec/Optimus/0069143c-1ef6-44f5-8005-30286f8fa978/call-SplitFastq/glob-5b3386a102763e9e23988d2e220cbc9d/fastq_R1_0.fastq.gz --readFilesCommand "gunzip -c" --soloCBwhitelist /cromwell_root/bican-um1-mccarroll_standard/broad/mccarroll/software/metadata/10X/3M-february-2018.txt --soloUMIlen 12 --soloCBlen 16 --soloFeatures GeneFull_Ex50pAS --clipAdapterType CellRanger4 --outFilterScoreMin 30 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIdedup 1MM_Directional_UMItools --outSAMtype BAM SortedByCoordinate --outSAMattributes UB UR UY CR CB CY NH GX GN sF --soloBarcodeReadLength 0 --soloCellReadStats Standard
    STAR version: 2.7.11a   compiled: 2023-08-15T11:38:34-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Dec 07 02:19:17 ..... started STAR run
Dec 07 02:19:20 ..... loading genome
Dec 07 02:38:20 ..... started mapping
Dec 07 02:54:49 ..... finished mapping
Dec 07 02:54:51 ..... started Solo counting
Dec 07 02:55:45 ..... finished Solo counting
Dec 07 02:55:45 ..... started sorting BAM
Dec 07 03:05:09 ..... finished successfully
alexdobin commented 11 months ago

Hi @jamesnemesh

Your interpretation of STARsolo output is correct! It "prioritizes" (and thus discards) exonic antisense over intronic sense alignment. I do not have a strong justification for that, except that I think (based on a very few examples I looked at) that CellRanger does something similar. If you have a CellRanger run on this sample, it would be interesting to see how it treats these reads. Both intronic and antisense reads are mostly artifacts of the library prep, so what to do with them is not very clear.

For this particular example, I think it would be interesting to see the overall RNA signal (sum over all cells) around this locus. If one of the genes clearly has a stronger expression than the other, it could resolve the question where to assign such ambiguous reads.

Cheers Alex

jamesnemesh commented 11 months ago

Hi Alex,

I agree with you that exonic antisense signal is probably noise - I'd prefer if there's truly a gene with antisense expression that the GTF includes that in the model. I'd guess somewhere around 2-3% of UMIs wind up on the "wrong" strand in both DropSeq and 10x data sets, but I've always discarded them.

Intronic reads may be a different story - it's true that there's some priming going on at polyA sites in the genome that contribute to the more recent 10x chemistry being less 3' biased than previously. However, the data sets we're working with now are nuclei, and there's a strong expectation in these data sets that more of the RNA will be immature. It's pretty easy to classify a mixture of whole cells and nuclei both by their %intronic reads as well as seeing expression in the whole cells of genes that should mostly localize to the cytoplasm. This leads me to think the intronic expression is a "real" phenomenon as opposed to noise. That's further backed up some downstream analysis that leverages expression like eQTL where including intronic expression boosts signal - where as if you were simply adding noise you would expect to find fewer significant associations that also replicate in other data sets.

As for this pair of genes - this was definitely something where my particular bias towards favoring on-strand intronic expression over off-strand exonic antisense expression picks up some UMIs. I only have pairwise comparisons of DropSeq, CellRanger V7, and STARSolo. In this case, I'm using the STARSolo processed BAM, and adding my own tags (gn, gs, tf) and my own expression calculations. This should minimize differences due to any other processing other than interpreting the gene models.

image image

If AC010327.4 is being expressed, it's not being expressed a lot, and CellRanger V7 and STARSolo mostly agree - Dropseq is the odd algorithm out. I'm as much interested in trying to understand the rules you're using, and changing at least some of my process to better match the consensus so I can understand if there's still anything significantly different left after I make those changes.

At the total number of UMIs per cell, all 3 methods strongly agree. At the individual gene level, there's still a bit of variation that I'm interested in understanding how the interpretations are different. I've been told if I used CellRanger V6 instead of V7 the results will be much more concordant, which I hope to validate soon.

image image
alexdobin commented 11 months ago

Hi James,

My thinking is that (most) antisense and intronic reads are artifacts. They are not supposed to be present if the cloning worked perfectly, i.e., the correct strand primed at the polyA tail. Even if you work with nuclei and expect a lot of primary RNA, most UTRs are long enough to prevent sequencing of intronic reads. Most intronic reads that we see are mis-priming events, primed off a polyA repeat inside introns. Counting intronic reads creates inter-gene bias: e.g., genes with more polyA repeats will get higher counts. However, this is not entirely noise, as intronic counts for the same gene will correlate with primary transcript abundance, and can be compared between cells.

For your example, my explanation would be as follows. The SYT5 gene has a substantial level of (exonic) expression, which may create a few spurious antisense reads (not real antisense transcription, but wet-lab artifacts). It's less likely that the AC### gene is expressed since the only reads captured are intronic, likely far from the 3' end. We are actually contemplating a more sophisticated model that will take some of these considerations into account quantitatively.

Cheers Alex