cole-trapnell-lab / cufflinks

Boost Software License 1.0
310 stars 116 forks source link

Account for truncated empirical distribution in effective transcript length calculation #32

Closed BenLangmead closed 9 years ago

BenLangmead commented 9 years ago

I noticed a lot of variability in the FPKM estimates for some relatively short isoforms when I would perturb the input BAM a little bit. (Note: the way I would perturb the BAM would affect empirical read/fragment length distributions somewhat.) I traced the variability to the transcript's effective_length and eventually to here.

For the affected gene(s) effective_length was calculated as being less than 1, which was puzzling, and would vary a lot (in relative terms, not in absolute) when input was perturbed. I think this pull request is the fix: the effective length calculation is using the pdf member of the EmpDist class, but should be using npdf because it's considering a truncated version of the emp. dist.