frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_by_tx_chain.py - use sort_introns_by_strand via pyranges.apply for multiprocessing speed up #1

Closed SamBryce-Smith closed 2 years ago

SamBryce-Smith commented 3 years ago

Currently convert to df, group by each transcript and then sort by intron position / number (1st in group = first intron)

Simple speed up would be to keep function the same but run via pyranges apply so working on multiple dfs (diff chr/strand) at once. pr.apply can return a dict (as_pyranges=False) which can be concatenated into a single df with pr.concat if want a single df at the end.

Alternatively groupby may be the bottleneck... To get the same effect, a combo of pyranges.apply (which will split dfs by chromsome and strand) and a two-factor sort_values(["transcript_id", <"Start">/<"End"> depending on strand may be quicker.

  1. See how much speed-up with using pr.apply & concat with existing code
  2. Test out pr.apply + sort_values
SamBryce-Smith commented 2 years ago

Should be closed with 4449302703bdafbc8241b21e442b884c4a2d1724