ZhengXia / dapars

DaPars(Dynamic analysis of Alternative PolyAdenylation from RNA-seq)
GNU General Public License v2.0
47 stars 33 forks source link

DaPars_Extract_Anno.py - coordinates in output BED file (and 'Loci' column) are shifted 1 nucleotide from source transcript #16

Open SamBryce-Smith opened 3 years ago

SamBryce-Smith commented 3 years ago

TL;DR:

+ strand: DaPars BED/'Loci' 3'UTR start is 1 nt downstream of the actual annotated start - strand: Dapars BED/'Loci' 3'end of the 3'UTR is 1 nt upstream of the actual annotated end

In lines 38-45 of DaPars_Extract_Anno.py, the 'start' coordinates for UTRs (last exons) are converted to '1-based' by adding 1 to the start coordinate. Given that the script outputs a BED file, and according to the BED format convention start coordinates are included in the range, adding 1 to the start coordinate actually excludes the first nucleotide of the UTR on the plus strand and the last nucleotide of the UTR on the minus strand.

Here is an example of extracted UTRs on the + strand. 'D Extracted 3'UTR' is the output BED file of DaPars_Extract_Anno.py, with 'D2 Extracted 3'UTR' the same file from the DaPars2 repo (the script is identical between the two tools). 'GTF' is the reference GTF file from which the input BED12 file was derived. In both cases, the last exon reported in the DaPars BED file begins 1nt downstream of the source transcript:

image

Again, on the - strand the DaPars BED file terminates 1nt upstream of the source transcript end:

image

I'm not sure how DaPars handles this internally (i.e. the coordinates in the BED file may not be interpreted according to BED conventions) and regardless I don't expect 1nt to make a drastic difference to the algorithm's output. The real issue comes when using the BED file/'Loci' column in output files to extract predicted polyA sites, especially the distal polyA site on the minus strand (as the Start coordinate which corresponds to the UTR end is shifted by 1nt).

Screenshots were generated using output from the DaPars & DaPars2 execution workflows from the APAeval project with test data used as input.

Edit: This issue has also been reported to the DaPars2 repo at 3UTR/DaPars2#8