alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Fewer reads mapping to Transcriptome with STAR 2.7.9a when using augmented genome annotation #1398

Open tatyana-perlova opened 3 years ago

tatyana-perlova commented 3 years ago

When we use STAR to map the results of targeted sequencing we merge the gencode annotation with specific amplicons we use for targeted amplification. (We design primers close to polyA site in the 3-prime UTR region for genes of interest so we want to make sure that resulting amplicons lie within the annotated region). So the exonic annotation for a specific gene might look something like this:

chrM    ENSEMBL exon    8527    9207    .   +   .   gene_id "ENSG00000198899.2"; transcript_id "ENST00000361899.2"; gene_type "protein_coding"; gene_name "MT-ATP6"; transcript_type "protein_coding"; transcript_name "MT-ATP6-201"; exon_number 1; exon_id "ENSE00001727012.2"; level 3; protein_id "ENSP00000354632.2"; transcript_support_level "NA"; hgnc_id "HGNC:7414"; tag "basic"; tag "appris_principal_1";
chrM    custom_panel    exon    8983    9002    .   +   .   gene_id "ENSG00000198899.2"; transcript_id "ATP6_custom_panel_ATP6_1"; gene_name "MT-ATP6"; gene_type "protein_coding";

As you can see our amplicon (custom_panel) in this case overlaps with the existing annotation, it inherits the gene_name and gene_id, but has a different transcript_id.

This .gtf is then used to create STAR reference:

/opt/STAR-2.7.6a/bin/Linux_x86_64/STAR
   --runMode genomeGenerate \
   --runThreadN 16 \ 
  --genomeDir STAR_reference/STAR_2.7.6a \
  --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
  --sjdbGTFfile amplicons_reference.gtf  \

The targeted library is mapped using the following parameters:

##### Final user re-defined parameters-----------------:
runThreadN                        14
genomeDir                         STAR_reference/STAR_2.7.6a
readFilesIn                       cartridge1_R2.fastq.gz  cartridge1_R1.fastq.gz   
readFilesCommand                  zcat   
outFileNamePrefix                 results
outStd                            Log
outReadsUnmapped                  Fastx
outSAMtype                        BAM   Unsorted   
outSAMattributes                  NH   HI   nM   AS   CR   UR   GX   GN   sS   sQ   sM   MD   
quantMode                         GeneCounts   
soloType                          CB_UMI_Complex
soloCBwhitelist                   /BD_rhapsody/barcodes_whitelist/barcodes_1.txt   /BD_rhapsody/barcodes_whitelist/barcodes_2.txt   /BD_rhapsody/barcodes_whitelist/barcodes_3.txt   
soloFeatures                      Gene   
soloUMIdedup                      1MM_Directional   
soloAdapterSequence               GGTAGCGGTGACA
soloAdapterMismatchesNmax         3
soloCBmatchWLtype                 1MM
soloCBposition                    0_0_0_8   2_-9_2_-1   3_1_3_9   
soloUMIposition                   3_10_3_17
soloUMIfiltering                  MultiGeneUMI   

Now the problem is that with the new STAR 2.7.9a we have much lower fraction of reads mapping to transcriptome when using merged annotation than we do when using default gencode annotation. All other QC metrics, like fraction of reads mapped to Genome and fraction of reads with valid Barcodes closely match. And the results for STAR 2.7.9a with default reference virtually match STAR 2.7.6a with merged reference, which is what we've been using before.

To be more specific, for gene MT-ATP6 from the above .gtf example I see high coverage for:

Was there any change in behavior in the new release that could cause such an effect?

Thank you very much!

alexdobin commented 3 years ago

Hi Tatyana,

I cannot think of any change that would have results in such behavior. when you say "coverage", what output do you have in mind: ReadsPerGene or Solo matrix?

Cheers Alex

tatyana-perlova commented 2 years ago

Hi Alex,

Thank you for your reply and for your continuing support and development of this great tool!

Let me know if you would like me to sent you minimal input files to reproduce the problem.

Thanks again!

Tatyana

alexdobin commented 2 years ago

Hi Tatyana:

I am not sure I understand how the GTF looked like when you were getting low coverage of the genes. I guess an example will be helpful. Aligned.out.bam should not be strongly affected by GTF contents - only some spliced alignments may be affected. Aligned.toTranscriptome.out.bam maybe affected more if there are issues with transcript labeling.

Cheers Alex