gagneurlab / MMSplice_MTSplice

Tissue-specific variant effect predictions on splicing
MIT License
39 stars 21 forks source link

pyranges.df is an expensive operation, best to avoid? #18

Closed endrebak closed 2 years ago

endrebak commented 5 years ago

pyranges are collections of dataframes, one per chromosome(/strand). When you do gr.df it concatenates those dataframes into one. This might be slow and memory-consuming, especially if you are going to make a PyRange of it afterwards, then you need to split the df on chromosome/strand again :)

If the files are potentially large, Instead of doing

df = gr.df
df = do_stuff_to_df(df)
gr = pr.PyRanges(df)

you should consider

gr = gr.apply(do_stuff_to_df)

Congrats on the publication btw. PyRanges was accepted with a minor revision in bioinformatics, so be sure to cite the next time :)

endrebak commented 5 years ago

You can also do

new_pyranges = {}
for k, df in gr:
    new_pyranges[k] = do_stuff_to_df(df)

gr = pr.PyRanges(new_pyranges)    

This avoids any inefficiency and is more pythonic as it depends on iteration, not applying functions.

s6juncheng commented 5 years ago

Hi @endrebak, congrats for publishing pyranges, It's really a great tool! Thanks for the suggestions, we will incorporate in the next step.