Create a baseline and minimal working prototype

gomezzz commented 2 years ago

Once we decide on the problem, we should try to find a minimal subproblem to start with (e.g. matrix inversion, start with 3x3 real matrices). Then, we can build a prototype notebook for that and from that decide how to make it into a nice module.

If we have ideas about it already, we might already consider different matrix sizes etc.

toveag commented 2 years ago

Current train of thought, in general independent of choice of problem. As of now, I have three vague approach ideas:

Try "neural-dynamic" system approach based on the existing work as in related literature. We formulate the problem as a system of ODE's representing a (time continuous) Recurrent Neural Network. The solution is the steady state/equilibrium of the network. If I have understood it correctly, this is strictly speaking not a supervised learning problem, as the connection weights in the network are uniquely defined by the input matrix. Advantage is that there are some results on it already. Disadvantages/challenges include finding a natural interfacing with PyTorch, and not really state-of-the-art.
The "naive" approach of generating a training data set of examples (i.e. matrices A_i)) and ground truths e.g. inv(A_i). The initial challenges are appropriate representation of the data (i.e. should it be in matrix, vector or some block/sequential form), and finding/motivating a network architecture that works in a semi-rigorous way rather than through trial and error. Also, as mentioned, constructing a smart training set.
A more "algorithm approximation" approach, so to speak. I thought of this idea of treating iterates from a standard numerical method (power iterations for eigvecs or Newton iterations for inversion, say) as a sequential data. The task is then, given X_0, to predict X_1,...,X_k. We take the last iterate as our approximation. Such iteration formulas are quite simple functions of A, like polynomials, and intuitively perhaps easier to learn than e.g. the mapping A -> A^-1. Not sure if it makes sense in practice, and could be inefficient. Although, might be interesting to try with newer methods, some RNN variant like LSTM or even transformer.

gomezzz commented 2 years ago

Let's start with naive approach first and see where it goes.

Next steps:

[ ] Investigate architecture #4
[ ] Training set #5

gomezzz commented 2 years ago

Start with a single matrix and overfit on that to see if you can invert that

Then move to a few different ones, to see if you can fit and so on ....

gomezzz commented 2 years ago

maybe aiming for a decomposition could be more tractable?

toveag commented 2 years ago

Progress so far includes that the pipeline seems okay, so far ive managed (over-)fitting a ReLU 6-layer MLP to 1000 matrices with some sort of accuracy. The off-diagonal elements of X*X_inv_pred are of approx. 10^-3. For fewer matrices its smaller, obviously. With 8 layers, I get reasonable yet not satisfactory results when trained on 10000 matrices, off-diagonals of max 10^-2 . Diagonal is close to 1. All this is for matrices that are similar - they have eigenvalues close to each other and are simultaneously diagonalizable so they have pretty homogenous structure.

The model actually seems to have a bit of generalizability for new matrices, at least when they are generated via the same function. Emphasizing small and not very stable, but indicates that the model is indeed learning something. Perhaps implies that a first application could be in a specialized setting, where the matrices share a known similar structure! Steps forward:

Try to get the same results with matrices that differ more, e.g. change the way they are generated so they don't share the same similarity transform.
Get it to generalize better.
Try the same technique with CNN.

toveag commented 2 years ago

When the matrices do not share the same similarity transform, it seems much harder for the MLP to learn/converge for more than just a couple of examples.
The simple CNN shows potential as it manages to really over-fit larger batches of matrices that are not too similar. However it is slow to train and seems unstable. When doing batch SGD on full data set I have not had any really meaningful results, although for matrices with small cond. no. it at least performs better than an input independent model.
Having some stability issues when trying to train w. batch SGD. Ive tried normalizing the loss with the mean condition no of the batch, which makes it more stable but not sure if this is good for convergence.
Maybe try to tweak the CNN w. hyper parameters, like kernel size and depth.
Everything does seem to depend a lot on the condition number so I will spend some time thinking if there are even better ways to generate matrices than what Im currently doing.

esa / neuralg

Create a baseline and minimal working prototype #3