Open y9c opened 3 years ago
Hi Chang,
mapping very short reads with mismatches to the whole genome is practically impossible with the current STAR algorithm. STAR relies on exactly matching seeds - such seeds will be very short for short reads with mismatches and will map to too many loci.
Cheers Alex
Thank you @alexdobin ,
Can I adjust the seed length to make short reads alignment works?
Hi Chang,
we can make an estimate: for 18b reads, if a mismatch is in the middle, the longest seed will be 9b, which will randomly occur every 4^9 bases, i.e. ~20,000 times in the human genome. This is what makes it impractical - every one of the 20000 loci have to be checked for the correct alignment. I think a better approach would be to mutate each read, which will give you 18*3 reads, and then look for exact matches for each of the mutated reads.
Cheers Alex
Thank you very much for you r explanation. I have another two questions for this assumption. 1) Can STAR generate the "mutation" automatically just in the same way as hisat2-3n. 2) If the read is a slitely longer, or has more than one mutation, the prosible combination will far greater than 18*3. Can I set seed length for this situation?
On Tue, Mar 30, 2021, 11:47 PM Alexander Dobin @.***> wrote:
Hi Chang,
we can make an estimate: for 18b reads, if a mismatch is in the middle, the longest seed will be 9b, which will randomly occur every 4^9 bases, i.e. ~20,000 times in the human genome. This is what makes it impractical - every one of the 20000 loci have to be checked for the correct alignment. I think a better approach would be to mutate each read, which will give you 18*3 reads, and then look for exact matches for each of the mutated reads.
Cheers Alex
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alexdobin/STAR/issues/1181#issuecomment-810372931, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKEVTRHXJGXRHVTR5M53LTGHW7PANCNFSM4ZMKEGBQ .
Hi Chang,
1) STAR cannot do it presently, but it looks like a good feature to have, though not easy to implement.
2) Effectively, the seed length is controlled by two parameters:
--seedSearchStartLmax deafult=50
, you would need to reduce to the seed size, i.e. readLen/2 in your case.
and --winAnchorMultimapNmax
which determines the max number of loci the seed can map to. As you increase it, you will also need to increase
--alignTranscriptsPerReadNmax default=10000
int>0: max number of different alignments per read to consider
--alignTranscriptsPerWindowNmax default=100
int>0: max number of transcripts per window
--alignTranscriptsPerReadNmax deafult=10000
int>0: max number of different alignments per read to consider
However, if you want to guarantee to find all alignments with one mismatch, the all-mutations approach is the best even for larger read lengths.
Cheers Alex
Using the End-to-End mode in the STAR aligner, short reads with mismatch are tend to be aligned into a wrong place on the genome.
Take human genome (GRCh38) as an example, fragment
.....ttccaagttcaacagctatggccggcgggccaagagactgcagcc....
is a repeat gene in the genome.For a certain input read, which contains a single mutation,
I expect STAR do align this reads into the those repeat copies in the genome (1st figure below), and the "nM:i" tag is 1. But STAR align this read into other positions in the genome, and the best hit is carrying 3 mismatch, "nM:i:3" (2nd figure below).
The code for reproducing:
Changing
--outFilterMultimapNmax
and some other arguments can no solve the this problem. Which argument should I modified to make STAR work properly in this situation? Thank you for your help!