frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_tx_by_chain.py - use .duplicated(subset="transcript_id", keep=False) to remove single_exon transcripts #11

Closed SamBryce-Smith closed 3 years ago

SamBryce-Smith commented 3 years ago

Filter currently uses groupby.filter() which may be less efficient due to looping.

Extracting first / last exons for each transcript is taking ~97.5s (on reference files), this may help cut down the time elapsed

SamBryce-Smith commented 3 years ago

The groupby & filter takes ~ 1 min on reference data (/ 1.5 mins total). Using duplicated inside pr.subset reduces this to ~ 10s - so even with filter_single=True the new function (combined with improvements outlined in #12) takes same time as previous without filtering!

This should make a big big time gain due to number of calls!

Linked to commits in #12 - 89ffc30a6dbe598553e4ca622c436b757a498ca1 is the main one