Closed SamBryce-Smith closed 3 years ago
The groupby & filter takes ~ 1 min on reference data (/ 1.5 mins total). Using duplicated inside pr.subset reduces this to ~ 10s - so even with filter_single=True the new function (combined with improvements outlined in #12) takes same time as previous without filtering!
This should make a big big time gain due to number of calls!
Linked to commits in #12 - 89ffc30a6dbe598553e4ca622c436b757a498ca1 is the main one
Filter currently uses groupby.filter() which may be less efficient due to looping.
Extracting first / last exons for each transcript is taking ~97.5s (on reference files), this may help cut down the time elapsed