LieberInstitute / SPEAQeasy

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.
http://lieberinstitute.github.io/SPEAQeasy
MIT License
6 stars 4 forks source link

Gene end coordinates don't always agree with GTF #88

Closed Nick-Eagles closed 1 year ago

Nick-Eagles commented 2 years ago

We currently grab gene end coordinates from FeatureCounts, which results in some rows of rse_gene potentially disagreeing (only in end coordinates) with the reference GTF. Instead all coordinates should be pulled from the GTF.

gpertea commented 2 years ago

Gene Length is also grabbed from featureCounts' output, but that doesn't seem to have been affected by the featureCounts upgrade from v1.5 to v2,0 - good news, because for most downstream analyses that was more relevant than the end coordinate..

I guess when the annotation files are built, a rda with the GRanges for genes and exons could be prepared and saved so it could be used later by create_count_objects.R. Length could be added to mcols() -- since for genes it's not simply the width of the GRanges interval but the sum of non-overlapping exon regions for all the transcripts in that gene -- a recipe to get that can be found here: https://www.biostars.org/p/83901/

Nick-Eagles commented 2 years ago

We also read exon coordinates from FeatureCounts output where the GTF should be used instead.