frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_by_tx_chain.py - refactor implementation of match_by="transcript" #3

Closed SamBryce-Smith closed 1 year ago

SamBryce-Smith commented 3 years ago

merge_ordered leads to populating NA rows for each reference_id in right df (novel-reference matches), even if it never had any matches with novel transcript in Q (left df of novel_tr | intron_id | intron_number).

The code is functional on tiny test datasets, but script crashes when scaling up to realistic GTF size (of 100ks/1mil rows) as this produces huge df filled with NAs

The idea of using merge_ordered so get similar df to any was clean but due to NA overpopulation unfeasible. May have to implement separate function to match chains by transcript (i.e. without having 1..n for intron number & 1,1..0 for matches), unless I'm doing sth wrong with my configuration of merge_ordered which is causing NA overpopulation - make a SO question).

SamBryce-Smith commented 1 year ago

Closing as don't plan to use filter_tx_by_intron_chain.py further