pyranges.df is an expensive operation, best to avoid?

endrebak commented 5 years ago

pyranges are collections of dataframes, one per chromosome(/strand). When you do gr.df it concatenates those dataframes into one. This might be slow and memory-consuming, especially if you are going to make a PyRange of it afterwards, then you need to split the df on chromosome/strand again :)

If the files are potentially large, Instead of doing

df = gr.df
df = do_stuff_to_df(df)
gr = pr.PyRanges(df)

you should consider

gr = gr.apply(do_stuff_to_df)

Congrats on the publication btw. PyRanges was accepted with a minor revision in bioinformatics, so be sure to cite the next time :)

endrebak commented 5 years ago

You can also do

new_pyranges = {}
for k, df in gr:
    new_pyranges[k] = do_stuff_to_df(df)

gr = pr.PyRanges(new_pyranges)

This avoids any inefficiency and is more pythonic as it depends on iteration, not applying functions.

s6juncheng commented 5 years ago

Hi @endrebak, congrats for publishing pyranges, It's really a great tool! Thanks for the suggestions, we will incorporate in the next step.

gagneurlab / MMSplice_MTSplice

pyranges.df is an expensive operation, best to avoid? #18