This library implements methods to facilitate minimally lossy matrix decomposition for sparse nonnegative matrices, under the paradigm described in Saul (2022).
In the problem setting, given a sparse nonnegative matrix X, we would like to return a low-rank matrix L of known target rank r. Applying a ReLU nonlinearity to L allows lossless recovery of X.
Four methods of estimating the low-rank representation L are currently offered:
All methods operate in an iterative fashion; the model-based methods are particularly analogous to expectation-maximization, in that they iteratively refine a model's parameters and recompute the posterior probability under the new parameters.
The library is accessed through Python calls (interactive REPL and scripts have been tested; Jupyter has not, but should work).
Only numpy is used extensively, but the other libraries offer more convenient implementations of some of the statistical operations.
As the fi_nomad
package is not yet published, you'll need to install it in local mode. The easiest way to do this is to:
pip install -e .
to make fi_nomad
an installed package for the environmentWe'll publish to pypi
once we're out of alpha and have picked a good name.
Load the observed non-negative matrix X
as a numpy array:
from fi_nomad import decompose
from fi_nomad.types import KernelStrategy
import numpy as np
import logging
# "info"-level log messages in the library won't be displayed unless the
# caller explicitly allows them--this is by design, since it's bad to let
# library code override its caller's logging strategy!
logging.basicConfig(level=logging.INFO)
# NOTE: Ensure that this uses a float dtype!
# If you try to use this method on an integer array, it
# isn't going to work very well!
nonnegative_matrix_X = np.array([...]) # or load from file, etc.
target_rank = 5 # as per your domain expertise
result_data = decompose(
nonnegative_matrix_X,
target_rank,
kernel_strategy=KernelStrategy.GAUSSIAN_MODEL_SINGLE_VARIANCE,
verbose=True
)
model_means_L = result_data.factors[0] @ result_data.factors[1]
model_variance = result_data.variance
# use model_means_L as appropriate for your application
# To visualize recovery of X, assuming target_rank was high enough to do so:
# First improve readability of printed numpy arrays:
np.set_printoptions(precision=5, linewidth=150)
# then pass the low-rank estimate through a ReLU nonlinearity and compare
# its result to the input sparse matrix:
relu_L = np.copy(model_means_L)
relu_L[relu_L < 0] = 0
print(relu_L)
The main entry point for the model-based low-rank matrix estimation is fi_nomad.decompose
.
Three parameters are required:
KernelStrategy.BASE_MODEL_FREE
-- a naive approach that just iteratively applies SVD, with no
underlying statistical modelKernelStrategy.GAUSSIAN_MODEL_SINGLE_VARIANCE
-- a Gaussian model as described in Saul (2022)KernelStrategy.GAUSSIAN_MODEL_ROWWISE_VARIANCE
-- a Gaussian model similar to that described in
Saul (2022), but which computes a different variance value for each row instead of using a single
global/mean varianceKernelStrategy.MOMENTUM_3_BLOCK_MODEL_FREE
-- an extension of the base model-free algorithm,
which uses the parametrization $L=WH$ to separately update $W$ and $H$, using matrix least squares, and extrapolating $Z$ and $L$ with momentum terms.Additionally, the following options are exposed:
svd_strategy
: Strategy to use for SVD step during iteration. Supported options:
SVDStrategy.FULL
: Full, deterministic decomposition.SVDStrategy.EXACT_TRUNCATED
: Deterministic/exact decomposition, but will not attempt to recover beyond target_rank
values.SVDStrategy.RANDOM_TRUNCATED
: Uses a randomized algorithm for truncated SVD. The default, with much better performance.initialization
: Initialization strategy for the first guess at the low-rank matrix.
InitializationStrategy.BROADCAST_MEAN
: the default for the model-based algorithm; creates a first-guess low-rank matrix
where each element is the mean of the elements of X
.InitializationStrategy.ROWWISE_MEAN
: creates a first-guess low-rank matrix where each element is the mean of the
elements of X
from the corresponding row. This is intuitively appealing for the per-row-variance Gaussian model algorithm.InitializationStrategy.COPY
: Uses a copy of X
as the first guess for the low-rank matrix. The naive EM algorithm is
currently restricted to this, though that may change in the future.InitializationStrategy.KNOWN_MATRIX
: When used along with a value for the initial_guess_matrix
parameter,
allows the caller to use any appropriately-sized matrix for the current low-rank estimate. (This is to facilitate
checkpointing and warm starts.)tolerance
: If set, the algorithm will stop early once the loss (defined as the Frobenius norm of the
difference between X
and the current estimate) is below this value.manual_max_iterations
: If set, the algorithm will use this number as the maximum number of iterations
(instead of running for 100 * target_rank iterations).verbose
: if set to True
(and the caller's logging config allows it), will log a running record of estimation performance.diagnostic_config
: Optional. If set, it should contain a DiagnosticDataConfig
data object including
the level of per-iteration diagnostic information to record and the base path configuration for this
output. This will only be useful when working with specific kernels that implement per-iteration
diagnostic data output.Regardless of settings, the estimator will generate a warning if the iteration-over-iteration likelihood was observed to decrease during the course of estimation. (This is quite common due to numerics noise once a good estimate has been reached.)
Lawrence K Saul (2022), "A Nonlinear Matrix Decomposition for Mining the Zeros of Sparse Data" https://doi.org/10.1137/21M1405769 (Preprint: https://cseweb.ucsd.edu/~saul/papers/preprints/simods22_preprint.pdf)
Seraghiti, G., et. al. (2023), "Accelerated Algorithms for Nonlinear Matrix Decomposition with the ReLU Function" https://arxiv.org/abs/2305.08687