Parallelizing pyBigWig.values()

nicolazilio commented 6 years ago

Hi there,

First of all, I'd like to say that pyBigWig and Deeptools are awesome tools. Thanks a lot for creating them.

I have been trying to parallelize pyBigWig.values() with the multiprocessing library without success. Essentially what I tried to do is something like this

import pyBigWig
import pandas as pd
from multiprocessing import Pool

bw = pyBigWig.open("/path/to/bigwig.bw")
coordinates = [(0, ["chr1", 120000, 130001]), (1, ["chr3", 160000, 170001]), etc...]
counts = pd.DataFrame(columns=[i for i in range(10001)])

def extract_data(data):
    index, row = data
    counts.loc[index] = pd.Series(bw.values(row[0], row[1], row[2])).fillna(0).astype(int)

pool = Pool(n)
pool.map(extract_data, coordinates)

Which works fine if n = 1, but for n > 1, I receive an error saying that there was a problem getting values.

Any ideas as to how to accomplish this?

Thanks a lot in advance

dpryan79 commented 6 years ago

Open the bigWig file inside extract_data().

nicolazilio commented 6 years ago

I tried. In principle that works in the sense that I don't get errors, but the problem is that, with that setup, as I increase the number of processes the computing time also increases.

dpryan79 commented 6 years ago

For small regions that are near each other multithreading won't help you. Once you have 100kb or megabase regions the overhead of reading and decompressing is no longer rate limiting. In general, opening files inside worker forks is the only way to reliably access files in parallel with python.

nicolazilio commented 6 years ago

I have done some more research and it seems that, as you pointed out, increasing the number of processes indeed does not help a lot. However, the biggest reason for the slowdown that I was seeing was actually adding new rows to the pandas data frame 1000s of times. I changed that to writing to file directly and things improved A LOT.

Thanks again for the help again.

dpryan79 commented 6 years ago

Glad you got things resolved!

deeptools / pyBigWig

Parallelizing pyBigWig.values() #74