frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_tx_by_pas.py - add script to filter transcripts for last exons with nearby PolyASite sites or PAS motifs #21

Closed SamBryce-Smith closed 2 years ago

SamBryce-Smith commented 2 years ago

Add a script which takes intron-chain filtered transcripts and applies additional filters to check their predicted 3'ends show evidence of being a genuine cleavage event:

  1. Does predicted 3'end fall within x nt of an polyA site from PolyASite / PAS atlas?
    • x is a maximum distance cut-off supplied at command line
    • So far I've only considered nearest upstream site. Should also consider nearest in either direction (if anything expect more likely for nearby site to be downstream (as coverage close to 3'end in SR data drops sharply due to fragment-size selection).
  2. Take the 3'most x nucleotides from last exons - do they have at least 1 of the conserved PAS motifs?
    • x is a target region length (i.e. last 100nt of each last exon). Note that pr.three_end() returns the final nucleotide of each range, so the actual 3'end 'upstream extension' should be target length -1.
    • Inbuilt motifs should be 12 from Beaudoing 2000 and 18 from Gruber 2016 (PolyASite v1.0). Also have option to provide a text file of motifs at CL
    • The distance/position should ideally be from the 3'end - this would mean reversing the order of the nucleotide sequence of regions on the plus strand (so left-most nucleotide is the most 3'/final nucleotide).

The script should take as input:

The script should output: