alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 502 forks source link

Mapped reads seemingly not reported in .Aligned.toTranscriptome.out.bam #735

Open climouse opened 5 years ago

climouse commented 5 years ago

Hi, I am running into a weird behavior when running STAR with the option --quantMode TranscriptomeSAM to obtain a bam file in transcriptome space.

My transcriptome is somewhat non-standard as I want to consider the set of gene bodies (as defined by the features of type "gene" in the gencode human transcriptome) as transcriptome rather than spliced genes. Therefore, during index creation, I used the following parameters --sjdbGTFfeatureExon gene --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_id

I suppose this should effectively bypass all the info on the exons-introns structure in the gff file, and result in an empty database of annotated splice junctions. Thus, my expectation was that any read that maps to an intron of a gene (say ENSGxxx) would be reported in at least one bam record of .Aligned.toTranscriptome.out.bam (with scaffold name ENSGxxx).

This seems to be the case most of the time, but some mapped reads that clearly align to an intron based on .Aligned.out.bam, do not have any corresponding alignment in .Aligned.toTranscriptome.out.bam.

The example below illustrates the situation. I am sure there is something I don't quite understand with respect to the parameters or expected behavior, but any help would be very much appreciated!

cat test.fastq
@read1
CCACCCACGCACACACACACACACACACACACACACACACACACACACACAATGGAGAGAAAACACACACAC
+
CCDCCDCCCCCCIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read2
GAGCAGACACTAAACAGGCTGCTGGTGGTGGTTCTCTGAGGAGGAATAAACCGGTTAGAGAGGCCAAGTCTGT
+
CCCDDFFFFFFFGGGGGGGGGGHHHGHGFGGGGHHHHHHHGHGGGHHHHHHHGEEFGHHHHGGGGGHHDGHHH
@read3
ATGATCACGAAGGTGGTTTTCCCAGGGCGAGGCTGAGGCCTGAGAATCAGTTGAGGCCAGGAGTTTGAGACCACCTTGGGCAACATAGTGAGATCCCAGCTCTACAAAAAATT
+
CCCCIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

The three reads align to a locus within ENSG00000160087.20 (UBE2J2 protein_coding) and each yields a single record in .Aligned.out.bam. Reads 1 and 2 have a corresponding record in .Aligned.toTranscriptome.out.bam, but read3 does not.

samtools view star.Aligned.out.bam | cut -f 1-8
read1   16      chr1    1265583 255     68M4S   *       0
read2   16      chr1    1270305 255     73M     *       0
read3   16      chr1    1268852 255     2S83M28S        *       0
samtools view star.Aligned.toTranscriptome.out.bam | cut -f 1-8
read1   0       ENSG00000160087.20      8232    255     72M     *       0
read2   0       ENSG00000160087.20      3509    255     73M     *       0

I am using the following command:

STAR --runThreadN 6 --readFilesIn test.fastq --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outFileNamePrefix star. --genomeDir STARgenomeV29gff3_GENEBODIES --outFilterMultimapNmax 10 --outSAMmultNmax 10 --outSAMattributes All --outReadsUnmapped None --outSAMunmapped Within --outMultimapperOrder Random --genomeLoad LoadAndKeep --sjdbGTFfeatureExon gene --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_id

The index was generated using

STAR   --runMode genomeGenerate   --runThreadN 12   --genomeDir STARgenomeV29gff3_GENEBODIES   --genomeFastaFiles GCA_000001405.15_GRCh38_no_alt_analysis_set.fna      --sjdbGTFfile gencode.v29.primary_assembly.annotation.gff3   --sjdbGTFfeatureExon gene   --sjdbGTFtagExonParentTranscript gene_id   --sjdbOverhang 150

For the gene in the example above, the indexing resulted in

cat STARgenomeV29gff3_GENEBODIES/transcriptInfo.tab | grep ENSG00000160087.20
ENSG00000160087.20      1253908 1273884 1251333 2       1       89
alexdobin commented 5 years ago

Hi @climouse

to output soft-clipped to the Transcriptome BAM, you would need to use --quanTranscriptomeBam Singleend. By default the soft-clipped are not output, this was done so that the default value works with RSEM. I will likely change this in the future.

Also, please use the latest release 2.7.2a, as the processing of GTF had problems in the earlier versions if you specified the same tag for --sjdbGTFtagExonParentTranscript gene_id --sjdbGTFtagExonParentGene gene_id

Cheers Alex

climouse commented 5 years ago

Hi Alex,

Thanks for your response and for continuously updating this fantastic aligner. This is really helpful. With the options you told me I can now seen all the reads I was initially expecting in the Aligned.toTranscriptome.out.bam file.

I would also like to suggest a couple of possible new features related to better connecting the two bams file (genome and transcriptome space).

1) it would great if the records in out.Aligned.toTranscriptome.bam contained an extra field to indicate which of the genomic alignment they came from. For example if a read aligns to two genomic loci and shows up as two records in Aligned.out.bam (with HI:i:1 for the first alignment and HI:i:2 for the second), there could be in Aligned.toTranscriptome.out.bam something like HG:i:<HI> where <HI> is equal to the HI value of the corresponding genomic record.

2) it is probably a little more of a sidetrack, but it would be really awesome if the records in genomic space also readily contained a “compact” information about the alignments to the transcriptome. What I have in mind more specifically is incorporating the concept of equivalence classes, as used in Salmon or Kallisto. This could be implemented for example by adding a new field with an integer value pointing to the index of an equivalence class (the set of all the scaffolds in the transcriptome a given read can align to, by which I mean read not just bam record). I have written a script which takes the two files Aligned.toTranscriptome.out.bam and Aligned.out.bam and performs such an operation but it would certainly be a lot more efficient to generate the equivalence class annotation on the fly. I was also wondering if other people might be interested in this feature!

Cheers,

Charles

alexdobin commented 4 years ago

Hi Charles,

very good suggestions, adding to my list!

Thanks Alex

climouse commented 4 years ago

Hi Alex,

Fantastic, thank you!

Cheers, Charles