Add option to perform sparse operations and use sparse matrices in optimizer

eloygeenjaar commented 2 years ago

Is your feature request related to a problem? Please describe.

In a project, I have found the memory usage of the optimizer to grow exponentially with the number of features and parameters (parameters like the degree of the polynomials I use in the Polynomial Library).

Describe the solution you'd like

I have already created a fork and swapped out some of the numpy operations in the Constrained SR3 optimizer with scipy.sparse operations. For the Constrained SR3 case, the np.kron() operation on line 219 can especially lead to large memory demands. Further, since most of the matrices in the optimizer (and I assume in the other optimizers as well) are highly sparse, it is possible to reduce memory demands significantly. I have noticed that the memory usage has gone down from about 64-512GB RAM to being able to run the code on my laptop, by replacing the operations with sparse operations. Not only that, but the code seems to run roughly 2 to 3 magnitudes faster. I think it may be possible to do this for more optimizers and make this a standard option during training, note that in my fork I have added an argument to the Constrained SR3: sparsity that you can set to True to train the model with sparse matrices and operations. My current implementation is far from perfect, but I wanted to alert you to the possibility of speeding up the code and decreasing memory usage, which may open up research with larger feature sets.

Describe alternatives you've considered

-

Additional context

This is my fork pysindy_sparse Note that I have done a quick test to compare the two methods, from my own experience the timing differences scale really quickly and so does the memory usage. For example, with around ~250 features I need to allocate 64GB RAM (for polynomial degree=1) and 128GB RAM (for polynomial degree=2) on my cluster to run similar code for a project. I can now run this with polynomial degree=3 on my personal laptop without any memory problems. This is the code for each setup, the first code snippet runs my first naive implementation and the second the current pysindy repo.

sys.path.append("/Users/egeenjaar/pysindy_sparse")
import pysindy as ps
import numpy as np
import time

np.random.seed(42)
data = np.random.rand(1200, 60).astype(np.float32)
library = ps.PolynomialLibrary(include_interaction=False, degree = 2)
library.fit(data)
n_features = library.n_output_features_
lhs = np.zeros((60, 60 * n_features), dtype=np.float32)
for i in range(60):
    lhs[i, i+1+i*n_features]=1
rhs = np.ones((60,), dtype=np.float32)
diff = ps.SmoothedFiniteDifference()
optimizer = ps.ConstrainedSR3(constraint_rhs=rhs, constraint_lhs=lhs, threshold=0.1, max_iter=30, verbose=True,
                                  sparsity=True)
model = ps.SINDy(optimizer=optimizer, feature_library=library, differentiation_method=diff)
start_time = time.perf_counter()
model.fit(data, t=1)
end_time = time.perf_counter()

print(f'Total time: {end_time - start_time}')
print(model.coefficients().toarray())```

**Output**

```Iteration ... |y - Xw|^2 ...  |w-u|^2/v ...       R(u) ... Total Error: |y - Xw|^2 + |w - u|^2 / v + R(u)
         0 ... 5.4201e+02 ... 7.4341e+00 ... 7.2600e+01 ... 6.2205e+02
         3 ... 5.4180e+02 ... 5.6356e+00 ... 7.2600e+01 ... 6.2004e+02
         6 ... 5.4180e+02 ... 5.6361e+00 ... 7.2600e+01 ... 6.2004e+02
         9 ... 5.4180e+02 ... 5.6361e+00 ... 7.2600e+01 ... 6.2004e+02
Total time: 0.9769660609999999
[[-0.24633193  1.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          1.         ...  0.          0.
   0.        ]
 [-0.17271745  0.          0.         ...  0.          0.
   0.        ]
 ...
 [-0.2733521   0.          0.         ... -0.94027844  0.
   0.        ]
 [ 0.          0.          0.         ...  0.         -0.94705056
   0.        ]
 [-0.17063552  0.          0.         ...  0.          0.
  -0.93824286]]

Current repo

sys.path.append("/Users/egeenjaar/pysindy")
import pysindy as ps
import numpy as np
import time

np.random.seed(42)
data = np.random.rand(1200, 60).astype(np.float32)
library = ps.PolynomialLibrary(include_interaction=False, degree = 2)
library.fit(data)
n_features = library.n_output_features_
lhs = np.zeros((60, 60 * n_features), dtype=np.float32)
for i in range(60):
    lhs[i, i+1+i*n_features]=1
rhs = np.ones((60,), dtype=np.float32)
diff = ps.SmoothedFiniteDifference()
optimizer = ps.ConstrainedSR3(constraint_rhs=rhs, constraint_lhs=lhs, threshold=0.1, max_iter=30, verbose=True)
model = ps.SINDy(optimizer=optimizer, feature_library=library, differentiation_method=diff)
start_time = time.perf_counter()
model.fit(data, t=1)
end_time = time.perf_counter()

print(f'Total time: {end_time - start_time}')
print(model.coefficients())```

**Output**
``` Iteration ... |y - Xw|^2 ...  |w-u|^2/v ...       R(u) ... Total Error: |y - Xw|^2 + |w - u|^2 / v + R(u)
         0 ... 5.4201e+02 ... 7.4340e+00 ... 7.2600e+01 ... 6.2205e+02
         3 ... 5.4180e+02 ... 5.6355e+00 ... 7.2600e+01 ... 6.2004e+02
         6 ... 5.4180e+02 ... 5.6360e+00 ... 7.2600e+01 ... 6.2004e+02
         9 ... 5.4180e+02 ... 5.6361e+00 ... 7.2600e+01 ... 6.2004e+02
Total time: 8.805510169
[[-0.24639121  1.          0.         ...  0.         -0.
   0.        ]
 [-0.          0.          1.         ...  0.         -0.
  -0.        ]
 [-0.17289819  0.          0.         ...  0.         -0.
  -0.        ]
 ...
 [-0.27381932 -0.         -0.         ... -0.94027504 -0.
  -0.        ]
 [-0.          0.         -0.         ... -0.         -0.94705407
   0.        ]
 [-0.17056681 -0.          0.         ...  0.         -0.
  -0.9382435 ]]

As you can see the coefficient matrices look the same (this needs further testing) and the timings are 0.97s vs 8.8s (I have noticed between a 20-100x speedup with ~250 features, but haven’t formally tested it because it takes very long to run). My current code is not at all ready to merge, because I first wanted to bring up this issue + discuss possible ways to integrate this into your codebase and/or other optimizers.

Running the same code with 100 features, I get the following timings: pysindy_sparse = 3.7579019259999997 (this is around 4 seconds on average) psyindy = 88.883578903s

Kind regards,

Eloy Geenjaar

akaptano commented 2 years ago

Nice job with this (also thank for you being thorough!) and I think you're right that all the optimizers would get several speedup factors from this changes. I will put it on my to-do list. In principle this would be a series of relatively simple changes of the code but would require a lot of testing to make sure we don't screw up each optimizer.

eloygeenjaar commented 2 years ago

Yes I agree that it needs a lot of testing. That's why I didn't want to dive too deep into creating all the code before mentioning it, especially because there are multiple design choices to be made regarding integrating this into the codebase. I would be happy to help if you would like.

P.S. note that I personally think the memory reduction is the largest improvement. Let's say we have 100 features and degree=2 for the polynomial library. In the _update_full_coef_constraints() function in the ConstrainedSR3 class, the sizes of the variables are: inv1.shape = (201, 201), H.shape = (201, 201), coef_sparse.shape = (201, 100). Then in the np.kron() operation, the output dimensions will be (20100, 20100). Assuming float32, this would be a 20100 20100 4 bytes = 1.16GB matrix, although most of the entries are zero. So for my example code, where I don't model the interactions in the polynomial terms, this scales by: ((degree input_features + 1) input_features) ^2. Using 200 features with the same number of degrees in the polynomial terms leads to a 25.6GB matrix, etc. This is just the output of the kron() operation that needs to (temporarily) be kept in memory, together with the input data and possibly other large sparse matrices (left-hand-side constraints = (input_features, (degree input_features + 1) input_features)) this may lead to the inability to run SINDy on datasets with >200 input features without the use of a sizeable cluster. I hope this enhancement can help make that more accessible to people working with smaller clusters and/or large feature sets.

znicolaou commented 2 years ago

Looks interesting, thanks @eloygeenjaar! You're certainly right that those kronecker deltas use excessive memory for large systems. I wonder if slicing can be used instead of matrix multiplication to eliminate the need for the kroneckers entirely. In any case, it will be good to check the optimizers for sparse operations like you suggest. I'll try to check out your code and the other optimizers some time soon.

eloygeenjaar commented 2 years ago

@znicolaou Yes I fully agree with you! I initially looked at trying to simplify the Kronecker product and I think that would make the enhancement I propose even better! The scipy.sparse matrices and operations were just an easier fix for my specific problem, but I fully agree. Note there are some options to remove the Kronecker product, such as tensordot. I can't directly find an equivalent operation in scipy.sparse, but there is probably a way to perform an equivalent operation as tensordor with sparse matrices.

akaptano commented 2 years ago

@znicolaou @eloygeenjaar A word of caution for altering these algorithms. One issue that comes up is that some of the operations (especially in the constrained and trapping SR3 optimizers) are dependent on a SVD (importantly, the pinv function). Often these inverse operations are performed on sparse matrices (the model coefficients or the constraints), meaning the matrices are often very ill-conditioned. For functions like pinv, there is a threshold value "rcond" such that singular values < rcond get truncated to zero. In some situations this can change the results of the algorithm. In some cases this may make the model fit better or worse, but mostly I wanted to warn you that you may change the function correctly but still produce slightly different models than before!

dynamicslab / pysindy