Closed jergosh closed 4 years ago
Hi there,
For accurate quantification of FPKM of RNA-Seq data, the read counts need to be normalised by feature effective length. To compute the effective length, the meanFragmentLength will be deducted from the feature length. Thus, the features lengthened less than the meanFragmentLength will be automatically dropped off.
Also see Lee et al. 2011 paper for more info about the effective length normalisation.
I understand why FPKM estimates for these features have to necessarily be 0 but it’s not useful to just drop these features silently without making it clear what the remaining features are.
As it is now, for anyone using the package, it is necessary to write code to remove features whose length is < meanFragmentLength. It would be more user-friendly to either set FPKM values for these features to 0 or at least return FPKM values for remaining features in a named vector.
It would appear that all features that are shorter than the meanInsertSize (in any column) are just dropped silently. Since
fpkm()
doesn't take feature names, this can make it quite tricky to figure out which FPKM values correspond to what feature (+ I imagine it will not always be obvious why the output matrix has different dimensions from the input one).