With the standard PolyASite BED file (and generally pas from 3'seq), polyA sites are typically represented as clusters (i.e. a region) rather than a single coordinate. This means that different predicted 3'coordinates can overlap with the same atlas site and not be updated.
This also means that the updating strategy will use the 3'most coordinate of the nearest cluster, not necessarily the 'representative coordinate' for that cluster (the position with the highest read support within the cluster). This may not be the optimal behaviour
This is probably acceptable within a single experiment, but makes things a little more complicated when combining predicted last exons across experiments. e.g. when generating BEDs of representative PAS for last exons, because multiple closely spaced 3'ends will be predicted for the same atlas site. Would also lead to effective duplication (i.e. exons differing by a few nucleotides), which is probably not a good thing for Salmon's index size (although shouldn't have an effect in practice for differential usage as the isoform expressions will be summed together).
First thoughts:
Behaviour of updating strategy should be at the very least documented.
Provide steps/example code to convert representative PolyASite coordinates to single-nucleotide BEDs if desired (prefer this to specifying code within script to extract rep coord as less general)
Note that using clusters over single nucleotides may be unfair to some sites over others, e.g. if look +/- 100nt of a 15nt vs 5nt cluster - searching for matches at larger windows in some cases.
https://github.com/frattalab/PAPA/blob/aa034556b3bb96448eb58044b7414e34f2a42862/scripts/filter_tx_by_three_end.py#L329 if the nearest atlas site is 0 i.e. directly overlapping, then the 3' coordinate is kept as is.
With the standard PolyASite BED file (and generally pas from 3'seq), polyA sites are typically represented as clusters (i.e. a region) rather than a single coordinate. This means that different predicted 3'coordinates can overlap with the same atlas site and not be updated.
This also means that the updating strategy will use the 3'most coordinate of the nearest cluster, not necessarily the 'representative coordinate' for that cluster (the position with the highest read support within the cluster). This may not be the optimal behaviour
This is probably acceptable within a single experiment, but makes things a little more complicated when combining predicted last exons across experiments. e.g. when generating BEDs of representative PAS for last exons, because multiple closely spaced 3'ends will be predicted for the same atlas site. Would also lead to effective duplication (i.e. exons differing by a few nucleotides), which is probably not a good thing for Salmon's index size (although shouldn't have an effect in practice for differential usage as the isoform expressions will be summed together).
First thoughts: