alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Mapping difference with one genome and this same genome included with others #1359

Open MinosBiosciences opened 3 years ago

MinosBiosciences commented 3 years ago

Hi Alex,

I have a 10X V3 dataset composed of a mix of cells from Homo sapiens, Mus musculus and Canis familiaris. I created a fake merged genome from Ensembl files (v101) of the 3 organisms where the chr names are not "1-2-3-..." but "Hs1-Hs2-Hs3-..." for Homo sapiens, "Mm1-Mm2-Mm3-..." for Mus musculus and "Cf1-Cf2-Cf3-..." for Canis familiaris.

By the end, I'm keeping only some BCs that correspond to Human cells. But once I retrieve the reads related to these BCs from the original dataset and map them to Homo sapiens genome only, some of them are unmapped (though they mapped to the fake merged genome with the same arguments).

I came up to isolate one single read in this situation. What happens is that, with the Homo sapiens, it maps to 2 loci (chr1 and chr4) whereas, on the merged genome, it was only mapping to chr1 (I wasn't allowing multimappers, that's why it was unmapped in Homo sapiens only genome). The exact same chr4 is however included in the fake merged genome. When I map this read to a genome composed only of the chr4, it is considered "too short".

Any idea how this is possible? Could it be something related to the "--genomeSAindexNbases" when I create the genome index? (I reduced it for the chr4 index but didn't increased it for the fake merged genome that is obviously greater than Homo sapiens genome)

Cheers, Mathieu

alexdobin commented 3 years ago

Hi Mathieu,

here is one explanation for this observation. The chr4 alignment depends on the seed that maps to too many loci in the 3-species genomes, and thus it's not found. In the human genome, this seed maps to a small enough number of loci, and so alignment is generated.

This is controlled by --winAnchorMultimapNmax = 50 by default, for the 3-species genome you can try to use a larger value (say, 150) to see if you can recover the chr4 alignment.

Cheers Alex

MinosBiosciences commented 3 years ago

Hi Alex,

Thanks for the quick answer! :)

I'm not sure I fully understand. What you call the seed is a kmer that would be found more than 50 times in the 3-species genome? In this case why would the chr1 alignment preferred? I probably didn't get it and I actually don't know exactly what STAR does! I tried to upgrade the value of the parameter you mentioned up to 1500 but it didn't change the result (still a single alignment to chr1).

I can provide the commands (and THE read) to reproduce the situation if necessary.

Cheers, Mathieu

alexdobin commented 3 years ago

Hi Mathieu,

seed is a subsequence of a read matching the genome sequence exactly. I think in your example the sequences of the alignment loci in chr1 and chr4 are different, so the seeds are different. The chr1 seed maps <50 times and chr4 seed maps >50 times in the 3-species genome, but both are <50 in the human genome.

I do not know if this is an explanation, but, in general, there is no guarantee that all alignments to the 3-species genome will be recovered in the 1-species genome and vice versa. However, the number of cases should be a small % of all reads. The alignments to the correct species are more trustworthy than to the combined species genome.

Cheers Alex

MinosBiosciences commented 3 years ago

Hi Alex,

Thanks for your answer, I now have a better understanding of how STAR works. Since the goal here was to have a perfect match, we'll probably change our strategy.

Cheers, Mathieu