gallantlab / cottoncandy

sugar for s3
http://gallantlab.github.io/cottoncandy/
BSD 2-Clause "Simplified" License
33 stars 17 forks source link

Specialized sparse matrix functions? #13

Closed alexhuth closed 8 years ago

alexhuth commented 8 years ago

I'm working on some code that will involve caching a lot of sparse matrices, and I want to do this through cc. There are a couple options for how to handle this, and I want to get others' input.

The easiest way would be to pickle the sparse matrices. This is what numpy.savez does (because it's what that function does with everything that's not a numpy array). But this has downsides: pickle is insecure and super non-portable. A pickled sparse array from one version of scipy might fail to unpickle in another version, creating a real danger of data-rot.

The preferred, portable way to save sparse matrices is to pull out their constituent arrays and save them (e.g. compressed sparse row matrices consist of three ordinary, contiguous arrays: data, indices, and indptr). Then upon load the constituent arrays are reconstituted into the full sparse array. This is fine, but involves boilerplate code. It would be easier for me if this code could be put straight into BasicInterface, but I'm a bit worried that it might be too esoteric.

So that brings me to my questions:

  1. Would anyone else be interested in having dedicated code for saving & loading sparse matrices in cottoncandy?
  2. What's the general policy on whether specific datatypes should be supported in cc? Is cc going to support saving/loading contiguous numpy-like arrays only, and everything else is a file object, or do we want to eventually include functions for saving and loading other types?
sslivkoff commented 8 years ago

I think this would be a great addition to cottoncandy.

I don't think it would make the CC codebase less pure in any sort of negative way. Numpy and scipy are already intimately related.

1) I don't have any use for sparse matrices at the moment. 2) I think it would be nice to have support for as many filetypes as people are willing to write code for as long as it's done in a clean/maintainable way.

r-b-g-b commented 8 years ago

I agree that it's not too esoteric. I wrote some code that does what you mention, extracting the data, indices, indptr arrays from the sparse matrix object and downloading/uploading them through cloud2dict/dict2cloud. We could even agree on using extensions, e.g. .csr or .coo, to enable auto-detection of which function to use to load the file.

alexhuth commented 8 years ago

@r-b-g-b ahh cool I should have realized. My initial thought was that the easiest thing to do would be to just force-convert everything to, e.g., CSR before uploading, and then convert back on download. In my experience the cost of that is not so high (much less than 1s for the size of matrices that i'm working with--175000 by 2500), but your experience may be different.

r-b-g-b commented 8 years ago

Time to convert probably depends a lot on the specifics of the matrix. I know for the 200m x 10k natural language prior matrices with 5 non-zero entries per row, converting can take a few seconds. And how would you specify which type of sparse matrix to return, an additional argument or have the user take care of it? The object name extension way seems cleaner to me, but maybe it's a little too bulky of a convention to add to the lithe cottoncandy style.

anwarnunez commented 8 years ago

can we close this? did the recent pull request address everything we wanted?

alexhuth commented 8 years ago

Yup let's close it! Sorry I forgot..