aidenlab / straw

Extract data quickly from Juicebox via straw
MIT License
63 stars 35 forks source link

python strawC should report data in a more usable format #50

Open ChenfuShi opened 4 years ago

ChenfuShi commented 4 years ago

Hello, Sorry if there was an easier way to extract data that I haven't seen but:

Is your feature request related to a problem? Please describe. The current way strawC reports data requires heavy conversion before being useful, while the normal straw reports a list of lists, strawC reports it as objects that can't be accessed easily. While I see that the extraction itself is many times faster than the normal version the added overhead to covert the data makes it slower or the same speed as normal straw.

%%timeit
data = strawC.strawC('NONE', hic_folder+files[1], 'chr22', 'chr22', 'BP', 10000)
extract = lambda x: (x.binX, x.binY, x.counts)
converted_data = np.array(list(map(extract, data)), dtype = np.int64)
matrix = scipy.sparse.coo_matrix((converted_data[:,2],(converted_data[:,0]//10000,converted_data[:,1]//10000)))

707 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
data = straw.straw('NONE', hic_folder+files[1], 'chr22', 'chr22', 'BP', 10000)
matrix = scipy.sparse.coo_matrix((data[2],(np.array(data[0])//10000,np.array(data[1])//10000)))

673 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Describe the solution you'd like Is it possible to report the data either like the normal straw, or as a numpy array, or even directly as a scipy sparse matrix? If I understand correctly it is possible to use numpy structures in c++ in pybind, maybe a version designed like that?

Thanks!

nchernia commented 4 years ago

This is a very good idea. It should be possible to extract as a numpy array or scipy sparse. We probably won't be able to get to this for a few weeks and would welcome any contributions from the community.

On Thu, Jun 4, 2020 at 5:27 AM chenfu shi notifications@github.com wrote:

Hello, Sorry if there was an easier way to extract data that I haven't seen but:

Is your feature request related to a problem? Please describe. The current way strawC reports data requires heavy conversion before being useful, while the normal straw reports a list of lists, strawC reports it as objects that can't be accessed easily. While I see that the extraction itself is many times faster than the normal version the added overhead to covert the data makes it slower or the same speed as normal straw.

%%timeit data = strawC.strawC('NONE', hic_folder+files[1], 'chr22', 'chr22', 'BP', 10000) extract = lambda x: (x.binX, x.binY, x.counts) converted_data = np.array(list(map(extract, data)), dtype = np.int64) matrix = scipy.sparse.coo_matrix((converted_data[:,2],(converted_data[:,0]//10000,converted_data[:,1]//10000)))

707 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit data = straw.straw('NONE', hic_folder+files[1], 'chr22', 'chr22', 'BP', 10000) matrix = scipy.sparse.coo_matrix((data[2],(np.array(data[0])//10000,np.array(data[1])//10000)))

673 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Describe the solution you'd like Is it possible to report the data either like the normal straw, or as a numpy array, or even directly as a scipy sparse matrix? If I understand correctly it is possible to use numpy structures in c++ in pybind, maybe a version designed like that?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aidenlab/straw/issues/50, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK2EWYNMHZUMWE7GMLX6VLRU5SHTANCNFSM4NSOE6QA .

-- Neva Cherniavsky Durand, Ph.D. Pronouns: she, her, hers Assistant Professor, Aiden Lab www.aidenlab.org