Fewer reads mapping to Transcriptome with STAR 2.7.9a when using augmented genome annotation

tatyana-perlova commented 3 years ago

When we use STAR to map the results of targeted sequencing we merge the gencode annotation with specific amplicons we use for targeted amplification. (We design primers close to polyA site in the 3-prime UTR region for genes of interest so we want to make sure that resulting amplicons lie within the annotated region). So the exonic annotation for a specific gene might look something like this:

chrM    ENSEMBL exon    8527    9207    .   +   .   gene_id "ENSG00000198899.2"; transcript_id "ENST00000361899.2"; gene_type "protein_coding"; gene_name "MT-ATP6"; transcript_type "protein_coding"; transcript_name "MT-ATP6-201"; exon_number 1; exon_id "ENSE00001727012.2"; level 3; protein_id "ENSP00000354632.2"; transcript_support_level "NA"; hgnc_id "HGNC:7414"; tag "basic"; tag "appris_principal_1";
chrM    custom_panel    exon    8983    9002    .   +   .   gene_id "ENSG00000198899.2"; transcript_id "ATP6_custom_panel_ATP6_1"; gene_name "MT-ATP6"; gene_type "protein_coding";

As you can see our amplicon (custom_panel) in this case overlaps with the existing annotation, it inherits the gene_name and gene_id, but has a different transcript_id.

This .gtf is then used to create STAR reference:

/opt/STAR-2.7.6a/bin/Linux_x86_64/STAR
   --runMode genomeGenerate \
   --runThreadN 16 \ 
  --genomeDir STAR_reference/STAR_2.7.6a \
  --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
  --sjdbGTFfile amplicons_reference.gtf  \

The targeted library is mapped using the following parameters:

##### Final user re-defined parameters-----------------:
runThreadN                        14
genomeDir                         STAR_reference/STAR_2.7.6a
readFilesIn                       cartridge1_R2.fastq.gz  cartridge1_R1.fastq.gz   
readFilesCommand                  zcat   
outFileNamePrefix                 results
outStd                            Log
outReadsUnmapped                  Fastx
outSAMtype                        BAM   Unsorted   
outSAMattributes                  NH   HI   nM   AS   CR   UR   GX   GN   sS   sQ   sM   MD   
quantMode                         GeneCounts   
soloType                          CB_UMI_Complex
soloCBwhitelist                   /BD_rhapsody/barcodes_whitelist/barcodes_1.txt   /BD_rhapsody/barcodes_whitelist/barcodes_2.txt   /BD_rhapsody/barcodes_whitelist/barcodes_3.txt   
soloFeatures                      Gene   
soloUMIdedup                      1MM_Directional   
soloAdapterSequence               GGTAGCGGTGACA
soloAdapterMismatchesNmax         3
soloCBmatchWLtype                 1MM
soloCBposition                    0_0_0_8   2_-9_2_-1   3_1_3_9   
soloUMIposition                   3_10_3_17
soloUMIfiltering                  MultiGeneUMI

Now the problem is that with the new STAR 2.7.9a we have much lower fraction of reads mapping to transcriptome when using merged annotation than we do when using default gencode annotation. All other QC metrics, like fraction of reads mapped to Genome and fraction of reads with valid Barcodes closely match. And the results for STAR 2.7.9a with default reference virtually match STAR 2.7.6a with merged reference, which is what we've been using before.

To be more specific, for gene MT-ATP6 from the above .gtf example I see high coverage for:

STAR 2.7.6a with merged annotation, reads aligned to transcriptome
STAR 2.7.9a default annotation, reads aligned to transcriptome
STAR 2.7.9a merged annotation, reads aligned to genome and basically no coverage for
- STAR 2.7.9a merged annotation, reads aligned to transcriptome

Was there any change in behavior in the new release that could cause such an effect?

Thank you very much!

alexdobin commented 3 years ago

Hi Tatyana,

I cannot think of any change that would have results in such behavior. when you say "coverage", what output do you have in mind: ReadsPerGene or Solo matrix?

Cheers Alex

tatyana-perlova commented 2 years ago

Hi Alex,

Thank you for your reply and for your continuing support and development of this great tool!

I actually used reads Aligned.out.bam and Aligned.toTranscriptome.out.bam loaded in IGV along with the genome annotation to look at the gene coverage.
We found a bug in how we were generating the merged .gtfs: instead of adding the full amplicons from inner primer to polyA site, we only added the primer so that this additional 20 nt transcript only partially overlapped the actual reads that are 138 nt long. Now that we fixed that bug we don't see the problem anymore.
I still don't understand why it was happening though. I would assume that the reads would simply map to other transcripts corresponding to the same gene.

Let me know if you would like me to sent you minimal input files to reproduce the problem.

Thanks again!

Tatyana

alexdobin commented 2 years ago

Hi Tatyana:

I am not sure I understand how the GTF looked like when you were getting low coverage of the genes. I guess an example will be helpful. Aligned.out.bam should not be strongly affected by GTF contents - only some spliced alignments may be affected. Aligned.toTranscriptome.out.bam maybe affected more if there are issues with transcript labeling.

Cheers Alex

alexdobin / STAR

Fewer reads mapping to Transcriptome with STAR 2.7.9a when using augmented genome annotation #1398