Why STAR output highly divergent loci rather than the best hit?

y9c commented 3 years ago

Using the End-to-End mode in the STAR aligner, short reads with mismatch are tend to be aligned into a wrong place on the genome.

Take human genome (GRCh38) as an example, fragment .....ttccaagttcaacagctatggccggcgggccaagagactgcagcc.... is a repeat gene in the genome.

For a certain input read, which contains a single mutation,

@test
AGCTATGGCCGTCGGGCC
+
EEEEEEEEEEEEEEEEEE

I expect STAR do align this reads into the those repeat copies in the genome (1st figure below), and the "nM:i" tag is 1. But STAR align this read into other positions in the genome, and the best hit is carrying 3 mismatch, "nM:i:3" (2nd figure below).

The code for reproducing:

STAR --runThreadN 1 \
     --genomeDir ~/Homo_sapiens.GRCh38.genome \
     --readFilesIn test.fq \
     --alignEndsType EndToEnd \
     --seedSearchStartLmax 12 \
     --seedSplitMin 9 \
     --outFilterMismatchNmax 5 \
     --seedMapMin 3 \
     --outFilterMismatchNoverLmax 0.2 \
     --outFilterMultimapNmax 100 \
     --outSAMmultNmax -1 \
     --outReadsUnmapped Fastx \
     --outSAMattributes NH HI AS nM NM MD jM jI MC \
     --limitBAMsortRAM 1000000000 \
     --outSAMtype SAM \
     --outFileNamePrefix test_star_

Changing --outFilterMultimapNmax and some other arguments can no solve the this problem. Which argument should I modified to make STAR work properly in this situation? Thank you for your help!

alexdobin commented 3 years ago

Hi Chang,

mapping very short reads with mismatches to the whole genome is practically impossible with the current STAR algorithm. STAR relies on exactly matching seeds - such seeds will be very short for short reads with mismatches and will map to too many loci.

Cheers Alex

y9c commented 3 years ago

Thank you @alexdobin ,

Can I adjust the seed length to make short reads alignment works?

alexdobin commented 3 years ago

Hi Chang,

we can make an estimate: for 18b reads, if a mismatch is in the middle, the longest seed will be 9b, which will randomly occur every 4^9 bases, i.e. ~20,000 times in the human genome. This is what makes it impractical - every one of the 20000 loci have to be checked for the correct alignment. I think a better approach would be to mutate each read, which will give you 18*3 reads, and then look for exact matches for each of the mutated reads.

Cheers Alex

y9c commented 3 years ago

Thank you very much for you r explanation. I have another two questions for this assumption. 1) Can STAR generate the "mutation" automatically just in the same way as hisat2-3n. 2) If the read is a slitely longer, or has more than one mutation, the prosible combination will far greater than 18*3. Can I set seed length for this situation?

On Tue, Mar 30, 2021, 11:47 PM Alexander Dobin @.***> wrote:

Hi Chang,

we can make an estimate: for 18b reads, if a mismatch is in the middle, the longest seed will be 9b, which will randomly occur every 4^9 bases, i.e. ~20,000 times in the human genome. This is what makes it impractical - every one of the 20000 loci have to be checked for the correct alignment. I think a better approach would be to mutate each read, which will give you 18*3 reads, and then look for exact matches for each of the mutated reads.

Cheers Alex

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alexdobin/STAR/issues/1181#issuecomment-810372931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKEVTRHXJGXRHVTR5M53LTGHW7PANCNFSM4ZMKEGBQ .

alexdobin commented 3 years ago

Hi Chang,

1) STAR cannot do it presently, but it looks like a good feature to have, though not easy to implement. 2) Effectively, the seed length is controlled by two parameters: --seedSearchStartLmax deafult=50, you would need to reduce to the seed size, i.e. readLen/2 in your case. and --winAnchorMultimapNmax which determines the max number of loci the seed can map to. As you increase it, you will also need to increase

--alignTranscriptsPerReadNmax               default=10000
    int>0: max number of different alignments per read to consider
--alignTranscriptsPerWindowNmax          default=100
    int>0: max number of transcripts per window
--alignTranscriptsPerReadNmax              deafult=10000
    int>0: max number of different alignments per read to consider

However, if you want to guarantee to find all alignments with one mismatch, the all-mutations approach is the best even for larger read lengths.

Cheers Alex

alexdobin / STAR

Why STAR output highly divergent loci rather than the best hit? #1181