frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_tx_by_intron_chain.py - get_internal_exons could avoid groupby using sort + ~ duplicated(keep="last") #10

Closed SamBryce-Smith closed 3 years ago

SamBryce-Smith commented 3 years ago

Group by turns the filtering into a looping operation which will be slow with many Txs (ref internal exons takes ~ 37s single-threaded )- I think this could be quicker.

Possible option is to make use of duplicated just on the transcript_id column, keeping the 'last' region_number row. THis boolean mask/Series can be negated to avoid selecting the last region for each group. I suspect groupby.idxmax() may already be quite efficient though so maybe minor speedup (especially if enforce sort for safety).

i.e. something like this

# Make sure gr is sorted by transcript_id & 'region number' (ascending order so 1..n)
gr = gr.apply(lambda df: df.sort_values(by=["transcript_id", region_number_col], ascending=True), nb_cpu)

# Subset for internal regions for each id (not first or not last)
gr = gr.subset(lambda df: (df[region_number_col] != 1) | ~(df.duplicated(subset="transcript_id", keep="last")),
                       nb_cpu)
SamBryce-Smith commented 3 years ago

final solution was this. Confusingly keep="last" sets the last as False and others True, so don't need to invert the mask to remove last exons.

    # Filter out last exons for each transcript (max exon_number)
    # & first exons for each transcript (exon_number == 1)
    out_gr = (exons_gr.subset(lambda df: (df[region_number_col].ne(1).astype(bool)) & 
                     (df.duplicated(subset=["transcript_id"], keep="last")), # sets last to 'False'
                     nb_cpu=nb_cpu
                    )
             )

Shaves off ~ 10s single threaded, multithreaded ~ 10s slower so am deprecating multiple threads for this function.

Also renamed to more general get_internal_regions.

closed with commit 2cc9c1109816e720ecb3a94de11e2ec846d94d87