EMS-TU-Ilmenau / fastmat

A library to build up lazily evaluated expressions of linear transforms for efficient scientific computing.
https://fastmat.readthedocs.io
Apache License 2.0
24 stars 8 forks source link

Storing and loading fastmat matrices #9

Closed fabiankrieg closed 6 years ago

fabiankrieg commented 7 years ago

Feature request

Problem

Fastmat offers a nice set of features for efficiently dealing with structured and sparse and whatever matrices. Now, some users might create pretty advanced matrices which take time to compute, using the several fastmat classes as containers to allow fast products. Storing these for later use (to disk) is not straight forward.

Solution

I did some research on the topic but got no thorough solution yet.

First idea: Make fastmat Pickle-able

As I'm a Python-newby I was also new to pickle. I learned that pickles allows pretty convenient serialization of python objects for e.g. file IO. I also made up a small example which was pretty convenient to implement. Consider some class like this:

import fastmat

class SomeBlockMatrix(fastmat.Matrix):
    def __init__(self, items):
        # *items is a list of matrices (fastmat, numpy, scipy sparse) 
        # that is accessed for calculation of products
        self._items = items
        [...]

now we use this as:

import numpy
import scipy

a = numpy.random.randn(10, 20)
b = scipy.sparse.rand(10, 20)

A = SomeBlockMatrix([a, b, a])
B = SomeBlockMatrix([A, A])

But how to store it to disk? To do so, we have to tell pickle how to pickle, which means, that we have to provide a __reduce__() function for SomeBlockMatrix. This function returns the name of the class, s.t. pickle can instantiate a new object of that class upon loading. Furthermore, it returns a tuple of arguments that are passed to the constructor of the class, s.t. an object of the same content is initialized by pickle

class SomeBlockMatrix(fastmat.Matrix):
    def __init__(self, items):
        # *items is a list of matrices (fastmat, numpy, scipy sparse) 
        # that is accessed for calculation of products
        self._items = items
        [...]

    # tell pickle how to pickle
    def __reduce__(self):
        # first argument: class
        # second argument: tuple of stuff required by constructor
        # reference: 
        # https://stackoverflow.com/questions/19855156/whats-the-exact-usage-of-reduce-in-pickler
        return (self.__class__, (self._items))

This pretty much did it, we can now write and load this to disk, hence, every item is pickable itself:

import pickle
filename = 'test.mat'

# store to disk
with open(filename, 'wb') as f:
    pickle.dump([A, B], f)

# load from disk
with open(filename, 'rb') as f:
    C, D = pickle.load(f)

# with C == A and D == B

Note

When I tried to pickle some Cython-stuff like fastmat matrices which have no pickle interface yet I always run into Seg-Faults. There was no warning message as it will occur for pure Python stuff.

Dill instead of Pickle

https://pypi.python.org/pypi/dill

dill extends python’s pickle module

I got some IOErrors when I did call my pickling function to save a file from a different module than the load function was residing at. The corresponding module was not found. There are some hints, e.g. in the discussion of https://stackoverflow.com/questions/2121874/python-pickling-after-changing-a-modules-directory, that this might not be the case with dill, as this directly serializes the objects. Not tested by me though

Further reads

Numpy is much faster at storing/loading matrices than pickle: https://github.com/mverleg/array_storage_benchmark

Security issues of pickle: https://www.synopsys.com/blogs/software-security/python-pickling/

More on Dill vs. pickle: https://stackoverflow.com/questions/33968685/pickle-yet-another-importerror-no-module-named-my-module

Harsh cornercases with pickles on Linux, unpickling on Windows https://github.com/uqfoundation/dill/issues/218

ChristophWWagner commented 6 years ago

Insight of the day:

Seems like the only member cython auto-pickling keeps failing upon is Matrix._info which is a packed INFO_ARR_s. Purpose of this is to maintain a space to store general information (shape, type, stuff like that) is a way subclassing won't interfere by duplicating these entries for each subclass. Currently, I am thinking of lifting that struct to a class itself and then implement pickling for this subtype. However, this would potentially lose the benefit of next-to-nothing fast access to exactly that general information. Also one has to proceed with caution as _info also holds data type pointers that are only valid for one session. But that would be stuff of that particular pickling routine, wouldn't it?

tl;dr: looks like the rabbit hole is not that deep as it seemed to be.

Maybe this enhancement would even come with the benefit of not having to deal with subclasses as long as these themselves only store basic (or picklable) data types. Hoorray! A toast to the cython guys, you rock!

ChristophWWagner commented 6 years ago

Newsflash! Pickle support for fastmat classes was just introduced to 0.1.2. Could you please check if everything work out fine for the use cases you described and let me know if there are any issues left with the current implementation? Please note that you need to have cython>=0.26 installed in order for pickling to work.