frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_tx_by_three_end - updating to atlas 3'end has unexpected(?) behaviour when polyA sites are intervals not single nucleotides #50

Open SamBryce-Smith opened 1 year ago

SamBryce-Smith commented 1 year ago

https://github.com/frattalab/PAPA/blob/aa034556b3bb96448eb58044b7414e34f2a42862/scripts/filter_tx_by_three_end.py#L329 if the nearest atlas site is 0 i.e. directly overlapping, then the 3' coordinate is kept as is.

With the standard PolyASite BED file (and generally pas from 3'seq), polyA sites are typically represented as clusters (i.e. a region) rather than a single coordinate. This means that different predicted 3'coordinates can overlap with the same atlas site and not be updated.

This also means that the updating strategy will use the 3'most coordinate of the nearest cluster, not necessarily the 'representative coordinate' for that cluster (the position with the highest read support within the cluster). This may not be the optimal behaviour

This is probably acceptable within a single experiment, but makes things a little more complicated when combining predicted last exons across experiments. e.g. when generating BEDs of representative PAS for last exons, because multiple closely spaced 3'ends will be predicted for the same atlas site. Would also lead to effective duplication (i.e. exons differing by a few nucleotides), which is probably not a good thing for Salmon's index size (although shouldn't have an effect in practice for differential usage as the isoform expressions will be summed together).

First thoughts: