alexdobin / STAR

RNA-seq aligner
MIT License
1.86k stars 506 forks source link

Supporting GRCh38/hg38 #39

Open schelhorn opened 9 years ago

schelhorn commented 9 years ago

I was wondering how STAR development will continue with regard to the new GRCh38/hg38 assembly and the existence of ALT contigs (Reference 1 Reference 2). To quote Heng Li:

GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary assembly. However, sequences that are highly diverged from the primary assembly only contribute a few million bp. Most subsequences of ALT contigs are nearly identical to the primary assembly. If we align sequence reads to GRCh38+ALT blindly, we will get many additional reads with zero mapping quality and miss variants on them. It is crucial to make mappers aware of ALTs. Source

Heng developed bwakit to approach a solution to ALT mapping for DNA-Seq, but as far as I know no similar implementation for RNA-Seq/spliced mapping has been proposed as of now. Since all relevant genome annotations for GRCh38/hg38 have been published, how is the recommended workflow for STAR concerning the new assembly?

Obviously, one can just remove the ALT contigs and map against hg38-noalt, but that would disregard the interesting advantages of the ALTs. Also, future assemblies will continue to capture known variation (until we will use the string graph directly, I imagine), so I suggest it could be fruitful sense to think about the consequences of this new assembly structure for spliced read mapping.

ndaniel commented 9 years ago

Hi!

Heng Li quotation refers to exome/whole-genome/targeting sequencing and it does not refer to RNA sequencing. Therefore it is not clear at all if those findings of Heng Li from DNA-seq translate into RNA-seq!

Cheers, Daniel

mbourgey commented 9 years ago

Hi, If alternate sequences contains genes sequences higly similar to the original copy, I guess the problem would occurs also in RNA-seq alignment.

my 2c

Best,

Mathieu

kippakers commented 9 years ago

In my experience with MHC sequences, the global read aligner parameters just don't apply. For example, you've got 3 genes; HLA-DQB1, HLA-DRB1, and HLA-DPB1. These genes have very similar sequences. Then each of these has ~10 haplotypes that are common, and hundreds of possible haplotypes. These haplotypes have extreme polymorphism, on the order of 1 SNP every 5 bp in the regions that actually distinguish the haplotypes. Then each gene has several pseudogenes which can be expressed.

Just figuring out what the genomic sequence of the 6 haplotypes that make up an individual's genome at these genes using short reads is a tough mathematical/computational problem. For most users, it's probably not worth it for the extra work STAR would have to do to solve this problem. But I agree with schelhorn that it's an area where some clever innovation could represent a big advance in read alignment. On the other hand, once sequencing reads are long enough, the problem will be fairly well solved.

ndaniel commented 9 years ago

Also, the similarity of two sequences (of two genes) is a (continuous) function of read length from the input FASTQ files. For example, for reads of length 15 bp most of the genes will look highly similar and for reads of 1000 bp most of them will be highly dissimilar.

I agree with kippakers that for most users it is probably not worth the extra work for STAR to solve this problem especially that it is highly experimental and the benefits are not very clear/sure.

schelhorn commented 9 years ago

Perhaps some clarification is in order; my reason for opening this issue was to survey pros and cons of incorporating ALT awareness into STAR.

Regarding risks, reads from transcripts located at ALT contigs will confuse STAR (or, for that matter, any read mapper) and result in (unfairly) reduced mapping quality since multiple good mapping locations exist on the reference. Of course, very long read lengths may help, but waiting for them to be cost-effective for RNA-Seq probably is not the wisest course of action. In addition to the mapping quality issue, most downstream quantitation tools also punish non-unique alignments and will often not consider such alignments for estimating expression. Therefore, the full hg38 genome including ALT genomes should probably not be used with STAR at the moment, in my opinion. Of course, hg38-noalt is fine to use.

Regarding opportunities, ALT contigs represent LRC/KIR/MHC loci that are highly variable across populations and of considerate interest to the immunology and immuno-oncology communities. Quantitating transcription of these entities requires aligning reads to an appropriate variation-aware reference such as hg38 in order to increase mapping rates and identify the correct haplotype. While longer read lengths may be used to rescue mapping rates by anchoring reads at invariable sequences, the identification of the haplotype that is actually expressed does not benefit significantly from longer reads if no appropriate reference sequence is available.

My point is not necessarily to push STAR development in a direction that dedicates restricted resources to support alternate reference contigs; rather I wanted to make Alex and others aware of the situation. For obvious reasons, the genomic variation community is a little farther ahead with this than the transcriptional people since the prior already have an ALT-aware aligner. However, usage of alternate loci will only increase in the future once more data about pointwise and structural variation is incorporated into the reference, so the problem of alternate sequences will not go away.

ndaniel commented 9 years ago

Some quick comments are here:

Regarding risks, reads from transcripts located at ALT contigs will confuse STAR (or, for that matter, any read mapper) and result in (unfairly) reduced mapping quality since multiple good mapping locations exist on the reference. ... In addition to the mapping quality issue, most downstream quantitation tools also punish non-unique alignments and will often not consider such alignments for estimating expression.

This is true but the issue is more complex than this. There are also:

These have very high similarity too and STAR (or any other aligner which is used for RNA-seq data) will get confused when trying to map a read on a gene or on its paralogs (or on its pseudogene or on another gene with overlaps it). These will get punish too by non-unique alignments. These are a challenge for RNA-seq data and aligner.

schelhorn commented 9 years ago

Sure, homology is a general problem in read mapping -- for DNA as well as RNA. Still, homologs that occur naturally within a single genome are a justified reason for mapping uncertainty. Artificially added homologs such as ALT contigs, on the other hand, were added to capture known variation and have a well-known (since we created it), mutually exclusive relationship. These are a different beasts since they can be processed explicitly by the aligner. So even keeping in mind homology as a general challenge does not absolve us (as a community) from working ALT information into the aligners, in principal. Else, to illustrate an extreme case, you will end up with no unique reference regions at all once all known human variation is incorporated.

ndaniel commented 9 years ago

From the examples above, some overlapping genes may also be seen also as artificially added (i.e. one gene is made longer on purpose in order to overlap another one, usually its neighbor) in order to capture known variation (and might have a well-known exclusive relationship, e.g. healthy vs tumor), like for example fusion genes (see: GOPC-ROS1 and FIP1L1-PDGFRA fusions which appear only in some cancers but in healthy samples these genes never overlap; therefore all these four genes GOPC, ROS1, FIPL1, PDGFRA are punished in most of the DEG analysis which do not involve those specific types of tumors were they appear)! One may argue that all fusion genes (which are represented as overlapping genes) can be seen exactly as ALT contigs!

JohnMCMa commented 8 years ago

In that case, I'd like to ask this question--ndaniel, for those who want to identify variants from their RNA-seq data, does that mean they should align it with, say, bwa, if they want to do alt contigs, or do you mean doing alt contig on RNA-seq data is never a good idea (and should be exclusive domain of WGS/WES)?

tdanhorn commented 8 years ago

In my opinion dealing with alternative alignments of any kind (homologs, paralogs, repeats, alt contigs, etc.) is not the job of the mapper (e.g. STAR), but of the counting algorithm. The mapper already does the right thing -- it marks a read a multimapped if it cannot unambiguously assign it to a single part of the reference sequence. How to interpret multimapped reads is a different question.

In the case of alt contigs, if you are just interested in gene-level expression (>90% of RNA-Seq in my experience) and you have a read that maps in several alternative sequences for the same gene (but nowhere else!), you just count it for that gene. Obviously you cannot use any of the standard tools (HTSeq, subread featureCounts, etc.) which will simply discard any multimapped read as "ambiguous".

If you are interested in some kind of typing/variant calling, you just add up all the reads that support each of the alternatives and see which ones come out on top. You don't care if a read is multimapped or not. (To distinguish truncations from full-length, etc., you will have to take the coverage of each alternative sequence into account when you compare support between alternatives.)

roryk commented 8 years ago

You could use hisat2 (https://ccb.jhu.edu/software/hisat2/index.shtml) to do alignments if you want to incorporate the alts with RNA-seq data, it is alt-aware and can do split read alignments and then call RNA-seq variants with GATK or VarDict (https://github.com/AstraZeneca-NGS/VarDict) with some tagging of edit events.

tantrev commented 8 years ago

Does hisat2 support genome patches? I'd like to use the latest "toplevel" genomes from Ensembl for RNA-Seq, but there doesn't seem to be much documentation on such support. I'm a little leery about multi-mappers, given that most of the mouse patches are "fixes."