Open travc opened 9 years ago
Hi Travis
Thanks for using EMASE and contacting. I think you right in guessing that the individualized pooled transcriptome is missing genes/isoforms/alleles that is present in the reference GTF file. This typically happens when insertions/deletions removes them. (Just curious to know how you created the pooled transcriptome adjusting for SNPs and Indels.)
A hack to get around the issue is to try manually adding the missing genes/isoforms/alleles and its "fake" lengths and try running EMASE. Since the pooled transcriptome does not have the sequences for these missing genes/isoforms/alleles, the quantitation would not really get affected.
Thank you Narayanan
We are really just following the first "real use case" example at this point. I'm still a bit fuzzy on the details myself since I just stepped in to help a colleague who got stuck.
I assumed that adjust_annotations.py
combined with prepare-emase
adjusts for SNPs and indels when creating the pooled transcriptome.
If transcripts with a length of 0 won't effect the results, then maybe a work-around is just to make the exception a warning instead. I'll play around with that a bit.
So an apparent hackish fix is to just replace the raise
on line 62 of EMfactory.py with a warning and setting those 0's to 1's.
self.target_lengths[self.target_lengths==0] = 1
I only understand what emase
is doing at a very superficial level, but if giving missing transcripts 'fake' lengths won't effect the results, then I think this fix should be kosher. There are probably more proper/better ways of handling it (or avoiding the problem in the first place), but it seems like a rare bug so maybe this quick fix is good enough.
Since the bug is still outstanding in the code, I'll let the maintainers close
I'm triggering
RuntimeError('There exist transcripts missing length information.')
We are following the example: https://emase.readthedocs.org/en/latest/examples.html#estimating-allele-specific-expression-from-human-rna-seq-data A probably important note is that we're working with a significantly rougher reference (A. gambiae) than human, mouse, ect. Samples are all the same species using the same ref and annotation. Anyway, my guess is that some wonkiness in the reference gtf is causing the problem.
Assuming my arithmetic is correct, I've traced it to the fact that there are some transcripts listed in
emase.gene2transcripts.tsv
which are not inemase.pooled.transcripts.info
.It appears that 37 transcripts are being lost someplace between generating
emase.transcripts.info
andemase.pooled.transcripts.info
.31238/2 = 15619 ... If I understand correctly, that should equal the total number of transcripts (15656). I could be wrong of course.