Open jeremymsimon opened 6 years ago
Hi @jeremymsimon,
So I'm trying to think about what issues could be reasonably dealt with here.
1) If the length of the sequence in the BAM header and the sequence provided in the FASTA file are different, this seems like a very difficult error to recover from since records can then contain alignments to bases about which we don't know.
2) If the same transcript appears multiple times in the input BAM header, but with different lengths, this also seems a difficult situation to allow. Exact duplicates are one thing, but I'm not sure if sequences ever appear with the same name but different lengths. If so, I'm thinking this would be a hard error.
So, I think at least one situation we could reasonably deal with is that the input FASTA file contains multiple entries with the same name (and same length / sequence). In this case, we could retain only one of them, and document / log the fact that multiple identical entries were present in the input. Of course, there would still be an issue if we had a mismatch as with your example STAR input, where STAR concatenated all 3 occurences of an identical transcript. Are there other cases you can think of where it would make sense to somehow deal with the issue in salmon and continue with processing (perhaps with some extra warnings / log info)?
I can answer some of this, but the rest I might have to do some digging. I currently don't know whether the identical RefSeq ID can appear multiple times but have multiple lengths - I suspect this might happen at some low but non-zero frequency, but I would have to look. Would it be bad to, given a FASTA file and information in the BAM header, check if there are exact duplicate sequences, count how many duplicates there are, sum the lengths, and check if THAT is the number in the BAM header? In other words, in the example above, there are 3 duplicates of 68, 68 * 3 = 204, which is what the BAM header shows. Salmon could perhaps do that math and allow for either 68=68 (like it does now) or 68=204 (after counting and multiplying duplicates)?
Yes, this would be possible. My main concern would be how different upstream tools handle this. We know STAR sums the lengths of the repeats --- does HISAT / BBMap etc.? I would like to avoid having too many special cases that are tied in with certain tools. I'm not completely opposed to having a few special cases, so long as there are only a few ;P.
We'll have to crowdsource the answer to that, as I'm mostly unfamiliar with the output from other aligners
I've found at least ~100 instances of identical RefSeq IDs that appear on more than one chromosome in the human genome reference. These are a mix of (I'm guessing) chrX/Y homologous genes, miRs, snoRNAs, and other long/short noncoding RNAs. This issue will also apply to genes appearing in genomic regions with sequence from multiple haplotypes (e.g. chr6_qbl_hap6, etc) since those also have identical RefSeq IDs. This is certainly bad practice for annotation purposes, but the consequence is that STAR effectively lumps the duplicates into one transcript model (e.g. 3 appearances of 68 bp miR means STAR annotates as one 204 bp gene), and salmon then finds a discrepancy during error modeling between the annotated and "STAR-observed" gene lengths, throwing a fatal error. One fix of course would be to remove duplicated sequences from the gene annotation prior to alignment with STAR, or to have STAR deal with duplicated sequences differently, but it might be nice to have some sort of check in place downstream as well to avoid failure.