frattalab / PAPA

PAPA (Pipeline-Alternative Polyadenylation) - Snakemake pipeline for analysis of APA from short-read RNA-seq data
GNU General Public License v3.0
1 stars 0 forks source link

filter_tx_by_intron_chain.py - add checks for empty dataframes when doing gr.apply / gr.assign to avoid KeyErrors & needing df conversion #16

Closed SamBryce-Smith closed 3 years ago

SamBryce-Smith commented 3 years ago

This would likely apply to:

Getting introns from gr.

Possibly other cases too. There should be/ I should find a way to remove these keys from a PyRanges object to prevent this though

SamBryce-Smith commented 3 years ago

This also happens sometimes (!!) with add_region_number, which internally uses pr.assign. THe key error is for a chromosome (not chromosome/strand pair), which is a little weird. Also, the error doesn't seem to happen each time you run the script (sometimes it sails through). I have no idea what is going on

SamBryce-Smith commented 3 years ago

'sometimes' is because occasionally a merged GTF will contain transcripts with 'undefined strand' (i.e. not '+' or '-'). PyRanges will then read this GTF in as an 'unstranded' object. add_region_number and other functions only return values if strand col is '+' or '-', so errors are raised when no output is produced for these funky chromsomes tuples. I think simplest way around is to filter these transcripts with undefined strand out