eturro / mmseq

Haplotype, isoform and gene level expression analysis using multi-mapping RNA-seq reads
GNU General Public License v2.0
67 stars 20 forks source link

mmseq STAR aligner #3

Open jeffhsu3 opened 9 years ago

jeffhsu3 commented 9 years ago

mmseq appears to work with STAR aligner output. Do you know of any issues that might arise from using STAR with mmseq?

eturro commented 9 years ago

The aligner must provide all of the best alignments of reads or read pairs to transcripts (not to the genome). By "best" I mean all possible alignments with a minimal number of mismatches. Bowtie1 can do this with the options in the MMSEQ documentation but Bowtie 2 cannot to my knowledge. Does STAR have such an option? If so it would be great if you could send me the details so that I can test this approach and, if it replicates the bowtie1 results (approximately), add it as an aligner option to the MMSEQ documentation. Thank you.

jeffhsu3 commented 9 years ago

Star does have such an option, I haven't run all of mmseq using bowtie1 yet (lots of fastq files for 1 sample on different lanes), but using the option '--outFilterMultimapNmax', '100'' produces similar output with that of bowtie 1. I believe that multimapping is the default option in STAR.

The flags are different however and the optional tags that both aligners output are different. STAR includes a 'NH:i:7' tag denoting how many locations that the read maps to.

jeffhsu3 commented 9 years ago

I should also note that I ran the aligned files from STAR with bam2hits and mmseq and they seem to produce reasonable numbers.

jeffhsu3 commented 9 years ago

The correlation between the log_mu generated by mmseq from STAR and bowtie alignments was 0.89. STAR had aligned 12% more read fragments than bowtie using the settings described in the README.

eturro commented 9 years ago

Thanks for letting me know. I'm still looking into this. I would note though that it is important that the aligner does not try too hard find alignments to the transcriptome because the fragments may have come from non-coding regions and therefore should rightly be left unmapped. In a previous assessment I've been involved with, we aligned to the genome with STAR and looked at how many of the aligned read pairs were transcriptome compatible. We did not find much difference compared to the number of bowtie1 alignments to the transcriptome, suggesting that bowtie1 aligns roughly the "right" number of read pairs, even when non-coding regions are excluded from the reference. So "more" alignments is not always "better".

jeffhsu3 commented 9 years ago

I definitely agree, in no way was I saying that STAR was better just, had more, I'm currently pulling all the fragments that differ between the two alignments to determine what is going on. Do you mean non-transcribed regions rather than non-coding regions?

eturro commented 9 years ago

Slip of the tongue; yes, I meant non-transcribed (post splicing) according to the source used to build the transcriptome reference. In typical experiments, upwards of 30% of read pairs appear to emanate from such regions (e.g. intronic or intergenic regions) so it is important not to misalign these (probably with plenty of mismatches) to the transcriptome because the aligner tries too hard.

eturro commented 9 years ago

STAR is a splice-aware aligner and that is not functionality you want switched on when aligning to the transcriptome; could this be why you got so many more alignments and do you know if it can be disabled?

jeffhsu3 commented 9 years ago

There is an argument for splice junctions (--sjdb), I'll try passing it in nothing.

On Tue, Apr 21, 2015 at 10:38 AM, Ernest Turro notifications@github.com wrote:

STAR is a splice-aware aligner and that is not functionality you want switched on when aligning to the transcriptome; could this be why you got so many more alignments and do you know if it can be disabled?

— Reply to this email directly or view it on GitHub https://github.com/eturro/mmseq/issues/3#issuecomment-94817845.