churchill-lab / emase

Expectation-Maximization algorithm for Allele-Specific Expression
http://churchill-lab.github.io/emase/
GNU General Public License v3.0
21 stars 13 forks source link

transcripts missing length information exception #3

Open travc opened 9 years ago

travc commented 9 years ago

I'm triggering RuntimeError('There exist transcripts missing length information.')

We are following the example: https://emase.readthedocs.org/en/latest/examples.html#estimating-allele-specific-expression-from-human-rna-seq-data A probably important note is that we're working with a significantly rougher reference (A. gambiae) than human, mouse, ect. Samples are all the same species using the same ref and annotation. Anyway, my guess is that some wonkiness in the reference gtf is causing the problem.

Assuming my arithmetic is correct, I've traced it to the fact that there are some transcripts listed in emase.gene2transcripts.tsv which are not in emase.pooled.transcripts.info.

It appears that 37 transcripts are being lost someplace between generating emase.transcripts.info and emase.pooled.transcripts.info.

$ wc emase.transcripts.info
15656  31312 291991
$ wc emase.pooled.transcripts.info
31238  62476 738868

31238/2 = 15619 ... If I understand correctly, that should equal the total number of transcripts (15656). I could be wrong of course.

narayananr commented 9 years ago

Hi Travis

Thanks for using EMASE and contacting. I think you right in guessing that the individualized pooled transcriptome is missing genes/isoforms/alleles that is present in the reference GTF file. This typically happens when insertions/deletions removes them. (Just curious to know how you created the pooled transcriptome adjusting for SNPs and Indels.)

A hack to get around the issue is to try manually adding the missing genes/isoforms/alleles and its "fake" lengths and try running EMASE. Since the pooled transcriptome does not have the sequences for these missing genes/isoforms/alleles, the quantitation would not really get affected.

Thank you Narayanan

travc commented 9 years ago

We are really just following the first "real use case" example at this point. I'm still a bit fuzzy on the details myself since I just stepped in to help a colleague who got stuck. I assumed that adjust_annotations.py combined with prepare-emase adjusts for SNPs and indels when creating the pooled transcriptome.

If transcripts with a length of 0 won't effect the results, then maybe a work-around is just to make the exception a warning instead. I'll play around with that a bit.

travc commented 9 years ago

So an apparent hackish fix is to just replace the raise on line 62 of EMfactory.py with a warning and setting those 0's to 1's.

self.target_lengths[self.target_lengths==0] = 1

I only understand what emase is doing at a very superficial level, but if giving missing transcripts 'fake' lengths won't effect the results, then I think this fix should be kosher. There are probably more proper/better ways of handling it (or avoiding the problem in the first place), but it seems like a rare bug so maybe this quick fix is good enough.

Since the bug is still outstanding in the code, I'll let the maintainers close