feature request: STAR polyploid?

mparker2 commented 10 months ago

Hi @alexdobin

Thanks again for writing & maintaining STAR. The latest diploid mapping feature is very cool. I was wondering how difficult you think it would be to extend it even further, to map to more than two haplotypes at the same time. There are several examples that I can think of that could make use of this:

Mapping bulk RNA-seq reads from polyploid species
Mapping single cell RNA-seq reads with complex mixtures of genotypes i.e. pools of cells from multiple individuals

The latter case is what I am currently interested in - my workaround is to map to the reference genome first with STARsolo, use SNPs to assign and demultiplex the cell barcodes from different genotypes, and then remap each genotype separately with STAR-diploid + STARsolo. The downside to this approach is that I end up with separate STARsolo output for the different genotypes & I am unsure if they are comparable given they have been processed differently from each other. It would be ideal (and also more sensitive for the genotyping step) to align to all haplotypes from the very start of the analysis.

Do you think this adjustment is feasible?

Best wishes Matt Parker

alexdobin commented 10 months ago

Hi Matt,

For a mixture of genotypes, STARsolo would need not only to map to polyploid genome but also de-multiplex before counting, which is a hard feature to implement. Your approach seems good to me. I do not think there are any issues with separate mapping/counting to each genotype.

mparker2 commented 9 months ago

Hi @alexdobin,

Thank you for your reply! I figured it might be tricky, & I agree my current solution is fine, just not as elegant!

By demultiplexing in this case do you mean the cell genotyping? I was not actually envisioning that STARsolo itself would have to do the genotyping, rather that it could report the haplotypes that provide the best alignment (which could be single haplotype or combinations of equally plausible haplotypes) in the bam file using the same ha tag, and that the reference-coordinate transformed alignments would then be used for counting (is this not how counting works currently for STAR-diploid?). A downstream tool could then be used for inferring the best genotype for each cell.

Thanks again Matt

alexdobin commented 9 months ago

Hi Matt,

If your variants from your multiple haplotypes do not overlap, you could try using STAR-WASP options with a combined VCF file. It does not construct the polyploid genome, but maps reads locally to each allele, and can output which allele it mapped to.

Yenaled commented 5 months ago

@alexdobin I have a very related question about RNA-seq from heterozygous 129/cast mice (which is a bit trickier than demultiplexing). And of course, since they're the same species (but different strains), most variants will be "overlapping".

I have two options:

Create diploid mm11 vs. 129 and mm11 vs. cast diploid genomes and do two separate alignments. Then go through the two SAM files and 1) filter for the reads that align better to the 129 genome, 2) filter for the reads that align better to the cast genome, 3) combine the two SAM files, 4) get rid of the reads that have both a 129 alignment AND a cast alignment.

OR

Use STAR-WASP with a combined 129+cast VCF.

Any idea on which workflow would be preferable?

Yenaled commented 5 months ago

Also, what is "a combined VCF file"?

Yenaled commented 5 months ago

OK, I think I have it figured out and it seems the most practical approach is a STAR+WASP based approach. See my response here: https://github.com/alexdobin/STAR/issues/2121

alexdobin / STAR

feature request: STAR polyploid? #2005