alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

STAR behavior with combined genomes #748

Open aminzia opened 5 years ago

aminzia commented 5 years ago

Hi Alex, I am testing STAR to detect EBV RNAseq reads mixed with human reads.

In one setting, I indexed a combined genome and aligned the reads to the combined genome. This method did not give any hits for the virus genome. In second, I mapped those reads that were not mapped to the human genome (aligned to human genome separately) to the EBV genome.

The second method gave some non-negligible number of hits. The read length are relatively short (50bp).

I was wondering if you could give a clue about the way you implemented the alignment that could make sense of this behavior. I imagined maybe there's some competition between the human and EBV genome when the index is combined that causes all reads to be aligned to the human. But then, if that's the case, those reads should not remain unmapped in 2nd approach to be mapped to EBV.

So, I am not sure why this happens.

Thanks Amin

alexdobin commented 5 years ago

Hi Amin,

Combined genome is usually superior to consequent mapping.

Couple of things to check first:

  1. Have you checked the multimapping reads that map to both EBV and the human genome in the 1st run (combined genome)?

  2. Do you see exactly matching reads in the 2nd run (separate genomes)? If they all have mismatches, they might be sub-par to alignments in the human genome and thus not reported.

  3. Can you extract a 50b sequence from the EBV and try to map to the combined genome?

Cheers Alex

aminzia commented 5 years ago

Thanks Alex. I will look into these points. But by checking the alignments in 2nd approach, I see some alignments that seem unusual. Look at this example that STAR has aligned to the ebv genome, specifically the CIGAR variable. There's two huge skips (N) with 3 relatively short stretches of match. This has caused a huge template length (89K). Is this a usual?

HTTGMBGXB190830... 163 HVV4 47031 255 7M43444N7M46018N23M1S = 47031 89499 GCTTGATACCACTGCTTGATACCACTGCTTGATACCAC AAAAAEEEEEEEEEEEEEEEEEEEEEAEEA</EEEAEE NH:i:1 HI:i:1 AS:i:5 0 nM:i:10

alexdobin commented 5 years ago

Hi Amin,

by deafult, STAR looks for the best spliced alignments, so on a small genome it can split the read into many small blocks all stitched together. To prohobit splicing, use --alignIntronMax 1 . If most reads mapping to EBV genome look like this, it likely means they do not belong there. And when you map to the combined genome, the alignments to the human genome are always better (or equal, if you see human-EBV multimappers).

Cheers Alex