Splice junction reads are possibly ignored when custom pre-mRNA reference is used

gokceneraslan commented 4 years ago

This page on the support website describes how to prepare a pre-mRNA reference package by

First, creating a GTF file where all transcript annotations are replaced with exons and everything else is removed (resulting in a 79MB file whereas original GTF is 1.3GB):

awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{ $3="exon"; print}' \
     refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf > GRCh38-1.2.0.premrna.gtf

Second, using this GTF to prepare the "reference package" where STAR index is also created with this GTF file without proper exon information (since real exon info is removed in the previous step):

cellranger mkref --genome=GRCh38-1.2.0_premrna \
                 --fasta=refdata-cellranger-GRCh38-1.2.0/fasta/genome.fa \
                 --genes=GRCh38-1.2.0.premrna.gtf

Here is how the resulting GTF file looks like in IGV compared to unmodified ENSEMBL v94:

Although the procedure is quite simple, I think one caveat of doing this is that the resulting STAR index doesn't have any information about the real exons and consequently, it doesn't know anything about exon-exon junctions.

This can be shown by simply comparing the STAR's splice junction db files namely sjdb* files (or actually comparing the sizes of files as a lazy comparison) in the star folder of the reference packages:

star folder in the standard refdata-cellranger-GRCh38-1.2.0:

1.2K chrLength.txt
3.1K chrNameLength.txt
1.9K chrName.txt
2.1K chrStart.txt
 38M exonGeTrInfo.tab
 16M exonInfo.tab
527K geneInfo.tab
3.0G Genome
 909 genomeParameters.txt
8.1G SA
1.5G SAindex
9.2M sjdbInfo.txt
7.2M sjdbList.fromGTF.out.tab
7.2M sjdbList.out.tab
9.5M transcriptInfo.tab

star folder in the custom pre-mRNA refdata-cellranger-GRCh38-1.2.0, created using the instructions above:

1.2K chrLength.txt
3.1K chrNameLength.txt
1.9K chrName.txt
2.1K chrStart.txt
5.7M exonGeTrInfo.tab
1.6M exonInfo.tab
527K geneInfo.tab
3.0G Genome
 878 genomeParameters.txt
7.6G SA
1.5G SAindex
   6 sjdbInfo.txt
   0 sjdbList.fromGTF.out.tab
   0 sjdbList.out.tab
9.3M transcriptInfo.tab

It's easy to spot that there is nothing in the sjdb* files in the modified index.

As a solution, we can simply copy transcript annotations in the GTF as extra exons and leave the rest of the annotations in the file instead of removing them. This should both provide all the required exon info to STAR and also cover introns with the extra added exon which makes cellranger count count intronic reads.

This can be achieved with the following command:

awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{ print; $3="exon"; print; next}{print}'  \
     refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf > GRCh38-1.2.0.premrna.gtf

diff between the original file and the modified file is given below:

I haven't checked if the counts in the count matrix produced with the proposed index are slightly higher, but I just wanted to ask about your opinion. What do you think?

gokceneraslan commented 4 years ago

Actually, it'd be great if @alexdobin or @brianjohnhaas could comment on that. Can STAR still map the junction reads if we use the GTF file where we only have transcript annotations that are renamed to exons?

evolvedmicrobe commented 4 years ago

Hi @gokceneraslan, you are indeed correct. The current pre-mRNA reference technique does not annotate exon-exon boundaries, but by annotating the current transcripts and then adding in the introns, one can improve the mapping (I've seen up to 8%ish). Space Ranger introduced a new option to mkref called --nuclei which does this automatically, (https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/advanced/references) and the next version of Cell Ranger will also implement this improvement.

If it'd be useful, I can try and send you a patch file that adds the nuclei option into the current mkref for Cell Ranger.

gokceneraslan commented 4 years ago

Hi @evolvedmicrobe (nice username :) )

Thanks for the reply. Couple of questions.

Do you think it'd make sense to update the instruction on the support page so that people use a better reference with exons? I don't know when the next version is coming up, but updating the instructions would help people who don't/won't use the cellranger version which provides the --nuclei option. Especially for older datasets, people tend to use older versions. They may not want to switch to new cellranger, make index with nuclei option, switch back to old cellranger and run cellranger count etc
I kind of understand the technical reasons as to why the --nuclei option is added to the cellranger mkref command, but I think it makes more sense to add a --count-gtf-feature option which is exon by default to the cellranger count command where an alternative value would be gene, which counts all reads within the gene boundaries. This would be easier for many people because it doesn't need building a new reference.
When the --nuclei option is used, is the STAR index also built with the modified GTF? I think it'd also be great to be able to use separate GTFs for read counting and STAR index building. Because it feels weird to mess with STAR's index for something we need at a later stage (= read counting).

Cheers.

evolvedmicrobe commented 4 years ago

This would be easier for many people because it doesn't need building a new reference.

Definitely. So, we've explored and enabled modes in Cell Ranger that basically allow you to count UMIs from reads that only uniquely map somewhere in a gene, and not necessarily on the exons, without having to make a whole new reference genome. We'd have to think about what to call that option, and would likely avoid using some of the current language associated with it, e.g. nuclei or pre-mRNA. I think the options a bit tricky to name, but we'll definitely take count-gene-feature under consideration.

When the --nuclei option is used, is the STAR index also built with the modified GTF?

Yes, and in fact you get slightly different results because of this relative to you modifying the read counting, because the different reference leads to slightly different alignments.

Do you think it'd make sense to update the instruction on the support page so that people use the correct reference?

I'll run that by our support group.

Especially for older datasets, people tend to use older versions

Hmmm.... that's a shame. Would be curious if you had any more insight into why people don't upgrade. As the software evolves we try to maintain backwards compatibility while implementing new features, so for example the next release of Cell Ranger is going to be a lot faster than the current version, and for that reason alone if I was an end user I would make the switch, but there's also pretty steady improvements in other areas that will often benefit older data.

alexdobin commented 4 years ago

Hi Gökçen, Nigel,

interesting discussion, let me add my 2 cents.

STAR alignments indeed depend on the annotations used at the genome generation step. For mature mRNA annotations, STAR uses the splice junctions to improve the spliced alignments. The "pre-mRNA" annotations do not contain splice junctions, and so the alignments are somewhat worse.

Using mRNA GTF for genome generation and pre-mRNA GTF for counting, as Gökçen suggested, would be indeed the best approach - you get the best alignments, and have control over counting. This is how STARsolo does counting for nuclei (GeneFull option).

Cheers Alex

gokceneraslan commented 4 years ago

When I compared two pre-mRNA references (instructions on the 10X webpage and my suggestion), I got pretty confused.

Here, the upper panel shows the 10X pre-mRNA reference and the panel below shows the one built with my suggestion:

It's interesting to see that STAR can still align some junction reads to the transcriptome. (Maybe split-reads with short overhangs?)
Reads mapped to the 3'UTR of this gene (AGER) using my reference seems around 7 times higher than the 10X reference.
When I look at the counts of filtered_feature_bc_matrix.h5 file in both runs, I see the opposite. The count matrix built using the 10X reference has 4 times higher UMIs for this gene compared to my reference. I used the latest cellranger (3.1) for this.

Could it be that the cellranger counting code gets confused because of the additional exons that I add? I wasn't expected this result at all.

evolvedmicrobe commented 4 years ago

It's interesting to see that STAR can still align some junction reads to the transcriptome. (Maybe split-reads with short overhangs?)

That's broadly consistent with STAR's approach, it can align across novel junctions, but it's even better at aligning across known junctions.

When I look at the counts of filtered_feature_bc_matrix.h5 file in both runs, I see the opposite. The count matrix built using the 10X reference has 4 times higher UMIs for this gene compared to my reference. I used the latest cellranger (3.1) for this.

I think it'd be easiest and most useful to chat about these issues, just sent an invite to connect via LinkedIn (or at least I hope it was sent to you!). Let me know if you have time for a zoom.

johnchamberlin commented 3 years ago

This is an interesting discussion. However, CellRanger QC report includes this documentation: "Reads Mapped Confidently to Transcriptome: Fraction of reads that mapped to a unique gene in the transcriptome. The read must be consistent with annotated splice junctions. These reads are considered for UMI counting."

This suggests that CellRanger failing to align spliced reads with the "pre-mRNA" workflow is irrelevant because it can only count these reads if splice junctions are provided. Is this correct?

Thanks, John Chamberlin

10XGenomics / cellranger

Splice junction reads are possibly ignored when custom pre-mRNA reference is used #76