filter_tx_by_intron_chain.py - get_terminal_regions uses sort & .duplicated(subset="transcript_id", keep=<first/last>)

frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data

GNU General Public License v3.0

1 stars 0 forks source link

filter_tx_by_intron_chain.py - get_terminal_regions uses sort & .duplicated(subset="transcript_id", keep=<first/last>) #12

Closed SamBryce-Smith closed 3 years ago

SamBryce-Smith commented 3 years ago

similar to #10 , but instead want to use mask to keep first or last regions of each transcript only.

Would need a sort([tx_id,region_number_col]) to be safe, then wrap duplicated in pr.subset as before...

Again for ref regions this is > 1.5 mins which is too long...

SamBryce-Smith commented 3 years ago

Shaves off ~ 5-10s on reference data even when include a sort for safety (when filter_single=False). Relatively small gain, but as this function is called at least 5 times this should save a considerable chunk of time!

Closed with 89ffc30a6dbe598553e4ca622c436b757a498ca1 ( & 9e776ab4876a199303b324d00c9462fbe68c8380 & b8efda9b10d6d2c5e28779ed0132784da4e23970 for removing nb_cpu argument as performance decreased)