Testing of initial dataset

rhyshawkins commented 1 year ago

There is now a small dataset (1,000 models and outputs) in the github repository.

The first task is to look at using some simple ML technique to approximately predict the output given an input in a blackbox type of approach.

1,000 models may not be sufficient, but I can generate further as necessary.

amartinhuertas commented 1 year ago

@sghelichkhani @rhyshawkins FYI ...

I talked yesterday with @GiteonCaulfied and I recommended him to start with a Fully Connected Neural Network.

Some decisions to be taken:

Software packages to leverage. @GiteonCaulfied has experience with PyTorch. Perhaps something to explore is Keras, and see whether we can leverage either PyTorch or TensorFlow from a common software interface.
Number of hidden layers and neurons per layer. We can start, e.g., with 2 hidden layers and 10 neurons per layer, and then see from there. But I anticipate that we would need to do a parametric study to see which is the architecture which achieves best trade-off among network expressivity and ease of training.
Activation function. I would start with tanh for all hidden layers, except for the last layer, as I imagine that we do not want to constraint the output to be within an interval [-1,1].
Loss function. MSE + regularization? (see Appendix A 1D Mars paper). Early stopping criterion?
How to split dataset among training, test and validation.
Adimensionalization of model parameters might help ML pipeline. I guess this requires a to regenerate data set.
As @rhyshawkins ... a dataset of 1K might not be enough. We can confirm this if we see overfitting.
What else?

amartinhuertas commented 1 year ago

Software packages to leverage. @GiteonCaulfied has experience with PyTorch. Perhaps something to explore is Keras, and see whether we can leverage either PyTorch or TensorFlow from a common software interface.

Forget about this. As far as I can see, Keras does not have support for PyTorch as backend. Thus, we can stick into PyTorch if @GiteonCaulfied is happy with it.

amartinhuertas commented 1 year ago

@rhyshawkins ... is there any code that we can leverage the visualize on the sphere the spherical harmonics produced by the ML system?

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas @rhyshawkins

I've implemented a naive Fully Connected Neural Network in Pytorch, some information are recorded as follows:

It has 3 hidden layers (first layer has 200 neurons, second one 150 neurons and third one 100 neurons)
Tanh() are applied for all hidden layers, except for the last layer.
Loss function now is purely MSE without regularization and the optimizer here is Adam Optimizer.
Training in 100 epochs and a batch size of 4
Dataset for Training, Testing and Validation is split into a ratio of 8:1:1
Since the range of output vector is huge compared to the range of input vector and the values of the output vector seem to be evenly distributed according to a plot I've printed in the file, a Scaler called MinMaxScaler is used to scale the output data before they are split into 3 datasets. (They will be scale back to normal values when testing)
The result of the training process is made deterministic using a random seed

However, Some problems come up during my implementation:

Accuracy is low, nearly 6 out of 20 predicated output match with the real output from the data provided (there are also some of the predicated output that are a bit close to the ground truth, and some completely different predicated output) Is this underfitting situation happen due to the lack of data? or is it purely because I the model I have now is too naive and a lot of things needs to be improved?
During my implementation, the model constantly only outputs the mean of the total output vectors, which means it decides to make the output vector stay the same instead of adjusting with different inputs, the current model doesn't have this problem but I am still not sure how to avoid this. (This one just got lucky when I am changing some hyperparameters)

Also, another problem that is not relevant to the 1D Geoid problem:

Is it ok to commit my progress using my ANU git account (the one registered with ANU email and used for GitLab, but my ANU email is not registered for a GitHub account yet so you can't click to check its profile, since there are no profile to be checked). You can see that the current commit I've made are under the name of Xuzeng He instead of @GiteonCaulfied , this is because the git account on my laptop is the ANU one and this @GiteonCaulfied is registered with my personal email.

GiteonCaulfied commented 1 year ago

It seems that I just closed the issue as not planned, I'll reopen with this comment since the problems are not solved yet

amartinhuertas commented 1 year ago

but my ANU email is not registered for a GitHub account yet

Can you try to add your ANU email to your github account?

https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-email-preferences/adding-an-email-address-to-your-github-account

amartinhuertas commented 1 year ago

Accuracy is low, nearly 6 out of 20 predicated output match with the real output from the data provided (there are also some of the predicated output that are a bit close to the ground truth, and some completely different predicated output) Is this underfitting situation happen due to the lack of data? or is it purely because I the model I have now is too naive and a lot of things needs to be improved?

I would start at plotting the loss evolution as a function of epoch identifier. Did it achieve a plateau? Which is the accuracy of the trained model with the training set itself (i.e., for 80% of the samples)?

amartinhuertas commented 1 year ago

During my implementation, the model constantly only outputs the mean of the total output vectors, which means it decides to make the output vector stay the same instead of adjusting with different inputs, the current model doesn't have this problem but I am still not sure how to avoid this. (This one just got lucky when I am changing some hyperparameters)

As a general comment, it would be great to design the experiments/code such that we have traceability. In other words, as we test with different values of the hyperparameters, we would like to be able to record the results (e.g., the trained network) so that we dont have to repeat them again/not having to rely on our memory.

amartinhuertas commented 1 year ago

As a general comment, it would be great to design the experiments/code such that we have traceability. In other words, as we test with different values of the hyperparameters, we would like to be able to record the results (e.g., the trained network) so that we dont have to repeat them again/not having to rely on our memory.

To this end, any of the following packages may help (Warning: I have never tried them):

amartinhuertas commented 1 year ago

Is it ok to commit my progress using my ANU git account (the one registered with ANU email and used for GitLab, but my ANU email is not registered for a GitHub account yet so you can't click to check its profile, since there are no profile to be checked). You can see that the current commit I've made are under the name of Xuzeng He instead of @GiteonCaulfied , this is because the git account on my laptop is the ANU one and this @GiteonCaulfied is registered with my personal email.

for the records, this is already solved ...

amartinhuertas commented 1 year ago

Tanh() are applied for all hidden layers, except for the last layer.

Looking at the outputs you plotted, the amplitude as a function of spherical harmonic identifier seems to have a piece-wise linear shape. I would also try ReLU as activation function instead of Tanh().

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas @rhyshawkins

I've been modifying the code for the model since last night, some changes are recorded as below:

Print some sample input as a plot, however, no patterns for the input are found by observing the plot
Changed some hyperparameters for the model (including epoch number, batch size and learning rate)
The activation function are all swtiched to ReLU instead of Tanh()
Implemented a early stop criterion: the model will record the current state and save it into a file when it finds a minimum validation loss so that we can test this "best" model later by loading these states (Not exactly early "stop" since the training will continue after that)
Implemented the definition of Accuracy: accuracy = 100 * correct_prediction // total_prediction (when the loss value between the prediction output and the actual scaled output from data is less than 0.02, the prediction will be considered as correct, this value of 0.02 is decided by observing some plots of comparison between the predicted output and the actual output)
After the training process, the total training loss and the total validation loss of each epoch will be plotted in 2 graphs. (These values will still be printed during the training process, plot is for better visualisation)
Implemented a test function for printing out the total loss and accuracy on testing a given dataset. This test function is later applied to four cases: Final model on training set, Final model on test set, Best model on training set, Best model on test set

The problems still exist:

After the training process, the current model will have a accuracy of 78% on the training set, but only 35% on the test set and 31% on the validation set.
If I switch to the best model found during the training process, the training accuracy will be lower, which is 36%. And the test accuracy is not improved, which is 31%
From the plots of training loss and validation loss during the training process, it can be observed the training loss achieves a plateau at the end while the least validation loss is found in epoch 56 (it later starts to increase during the training)

Not sure if it's overfitting, I've tried to solve the above issues in different ways, including:

Change batch size, learning rate and number of total epochs (increase or decrease)
Change the hidden layer structure (cut down to 2 hidden layers, increase/decrease the number of neurons in each layers, adding a dropout layer)
Switch to SGD optimizer instead of Adam optimizer (and also use the weight_decay parameter of SGD optimizer to perform L2 regularisation)
Applied a different scalar to the output data (StandardScalar) to perform data standardisation instead of using the current MinMaxScalar to perform data normalisation. Also tried both of these on the input data

Unfortunately, none of these works. Even I tried different combinations of the above techniques, In the best case I can do, the training accuracy is still around 80% for the final model while the testing accuracy is around 35% and never exceeds 40%. If I switch to the best model (the one with the lowest validation loss during training), the training accuracy will be worse and testing accuracy is not improved either.

The updated version will be pushed shortly after this comment.

amartinhuertas commented 1 year ago

I have a couple of questions for @rhyshawkins and @sghelichkhani:

Is the geoid problem well-posed (i.e., unique solution, and continuity of the solution w.r.t. to input) w.r.t. arbitrary variations of the viscosity across depth? Looking at the plots of the input that @GiteonCaulfied has generated, I see some viscosity functions which are far from being smooth, even some of them look like step-like Heaviside functions (infinite gradient).
Same question as 1. but regarding the solver. Is the solver robust no matter the viscosity function?

amartinhuertas commented 1 year ago

@GiteonCaulfied ... can you also try the following:

Gradient Descent instead of Stochastic Gradient Descent. The dataset is relatively small, and then I guess that we can afford going over the entire dataset in each epoch. This way we remove the noise related to stochasticity in the gradients.
BFGS as optimizer. I know LBFGS is the way to go for large problems, but our problem is relatively small.

amartinhuertas commented 1 year ago

Since the range of output vector is huge compared to the range of input vector and the values of the output vector seem to be evenly distributed according to a plot I've printed in the file, a Scaler called MinMaxScaler is used to scale the output data before they are split into 3 datasets. (They will be scale back to normal values when testing)

Do you have evidence that this is actually helping? I guess the answer is yes. But just to be sure.

GiteonCaulfied commented 1 year ago

Do you have evidence that this is actually helping? I guess the answer is yes. But just to be sure.

Hi @amartinhuertas

If I don't apply the scalar to the output, the loss value will be extremely huge, making it hard to optimize in the first place and the predicted result is just a straight line when reaching the plateau (I assume that this is because some values in the output vectors are extremely larger than the others, which make the loss value kind of "bias" to them). So it's better to scale them to a smaller range (here is between 0 and 1 for every value in the output vector)

Gradient Descent instead of Stochastic Gradient Descent. The dataset is relatively small, and then I guess that we can afford going over the entire dataset in each epoch. This way we remove the noise related to stochasticity in the gradients.

I am not sure if I understand this correctly. But I think Stochastic Gradient Descent means that we set the batch size to 1 (use the optimizer after we calculate the loss value of a single data's prediction, do it for every piece of data in the dataset , which is 1000 times, in a single epoch) and the Gradient Descent means that we set the batch size to the size of the training set (use the optimizer after we calculate the loss value of the complete data's prediction as a whole, do it only once in a single epoch since we've already used all 1000 vectors when calculating the loss)

If my understanding for these two terms is correct, then what I am using now is actually Mini-Batch Gradient Descent instead of these two. (set the batch size between 1 and 1000, here is 16)

For reference, here's a Stackoverflow answer that matches with my understanding: https://stackoverflow.com/questions/72496224/is-sgd-optimizer-in-pytorch-actually-does-gradient-descent-algorithm

I've tested a batch size of 1 and 1000 after reading your comment, but unfortunately both of these methods don't improve the accuracy. (Still less than 40% for the test set and nearly 80% for the training set in the best case)

I noticed that you mentioned stochasticity in the comment. To remove the possible stochasticity in sampling my batch (avoid selecting the same data more than once when forming the batch), I tried to use another way to split the complete dataset (using random_split instead of SubsetRandomSampler) and "sampling" (shuffle them first and retrieve them batch by batch normally to prevent from selecting the same vectors) from them, however, no improvement either.

BFGS as optimizer. I know LBFGS is the way to go for large problems, but our problem is relatively small.

For the BFGS optimizer, I can't seem to find one in the Pytorch library. There is one LBFGS optimizer, but that one doesn't work well as expected. Therefore, all my attempts above used Adam Optimizer

amartinhuertas commented 1 year ago

If I don't apply the scalar to the output, the loss value will be extremely huge, making it hard to optimize in the first place and the predicted result is just a straight line when reaching the plateau (I assume that this is because some values in the output vectors are extremely larger than the others, which make the loss value kind of "bias" to them). So it's better to scale them to a smaller range (here is between 0 and 1 for every value in the output vector)

Ok, so you clearly have evidence that this is helping. Good.

I am not sure if I understand this correctly. ....

Actually I did not use the proper terminology, sorry. I meant to compute the full gradient at each optimization iteration, i.e., setting the batch size to the size of the data set. I have seen they call it "full batch gradient" in the ML literature. You already did this, and it did not help, so we know that the cause of the problem does not seem to be related to reduced batch set/stochasticity. Stochasticity in the computation of the gradient is a technique which makes only sense when your batch size is less than the size of the data set.

I noticed that you mentioned stochasticity in the comment. To remove the possible stochasticity in sampling my batch (avoid selecting the same data more than once when forming the batch), I tried to use another way to split the complete dataset (using random_split instead of SubsetRandomSampler) and "sampling" (shuffle them first and retrieve them batch by batch normally to prevent from selecting the same vectors) from them, however, no improvement either.

Ok, so you tried with a batch size smaller than the data set size, but a different randome strategy to select which components of the gradient to compute at each iteration.

For the BFGS optimizer, I can't seem to find one in the Pytorch library. There is one LBFGS optimizer, but that one doesn't work well as expected. Therefore, all my attempts above used Adam Optimizer

I found this: https://github.com/rfeinman/pytorch-minimize Not sure if it can be easily combined with PyTorch, never did it before.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas

I found this: https://github.com/rfeinman/pytorch-minimize Not sure if it can be easily combined with PyTorch, never did it before.

I tried to use this one, however, it requires a huge amount of memory for the BFGS optimizer to work and it gives me an OutOfMemoryError. I found that the amount of memory it needs is related to the structure of the neural network. Even though I cut down the structure of my network to a single hidden layer with 10 neurons, the error still exists not only for my laptop, but also for the GPU resources I manage to use on Google Colab since the memory it requires is in the unit of GiB. Therefore, I have to give up using this method.

amartinhuertas commented 1 year ago

I tried to use this one, however, it requires a huge amount of memory for the BFGS optimizer to work and it gives me an OutOfMemoryError

Ok, thanks for trying that. Let us forget about BFGS, then. The dataset of 1K entries/neural network architecture seems to be too much for this solver.

rhyshawkins commented 1 year ago

A good start, some of the NN predictions look very good (and some very bad). It may be useful to plot both the input and the predicted output versus truth for the best, worst and mean/median errors to see if there is a quality of the input that is not being captured well, e.g. do models with lots of jumps perform poorly?

I've created two new datasets for 20 thousand samples each. One is using the same prior as the initial dataset (results_20k folder) and the other is using a reduced prior such that the perturbations to the output should be far smaller (results_20k_zero folder) and I imagine the mapping from input to output would be closer to linear and therefore easier to train. Google drive link below (let me know if it doesn't work)

https://drive.google.com/drive/folders/1B_-9ukRuninMnmxkB76qBptQqocSdOAC?usp=sharing

If Xuzeng could try the results_20k_zero to see if the training becomes better that may be a good next step. 20k may still not be enough depending on the complexity of the NN. I can generate 100k if we think that would be necessary/useful.

Couple of answers to questions:

The geoid code is quite stable and reasonably well tested. Jumps in the viscosity function are perfectly ok and the derivative is not used in the forward code, it works on a set of homogeneous layers with discontinuities at layer boundaries.
It is the case that the inverse problem is not well posed. By this I mean that one input vector gives a unique output vector, but one output does not correspond to a unique input if that makes sense.

GiteonCaulfied commented 1 year ago

Hi @rhyshawkins @amartinhuertas

A good start, some of the NN predictions look very good (and some very bad). It may be useful to plot both the input and the predicted output versus truth for the best, worst and mean/median errors to see if there is a quality of the input that is not being captured well, e.g. do models with lots of jumps perform poorly?

I've further implemented the test function so that it can plot both the input and the predicted output versus truth for the best and worst errors. And I have tried to use the results_20k_zero dataset to train the model.

However, the performance of the model is still not good. For the latest model after training, the accuracy for the training set is still way higher than the accuracy for the testing set. (60% versus 29%, I could increase the number of epochs to make it higher than 60% but this is meaningless since the accuracy for the testing set is basically stuck at 29% while the validation loss is going up and up during this process)

I tried to observe some patterns from the plots of both the input and the predicted output versus truth for the best and worst errors, however, I can't figure out a potential approach for improvement since the inputs for the best case and the worst case are seemingly random during my several runs and it's not related to the number of jumps.

I also tried to make my model only contain input layer and output layer (no hidden layers), and accuracy still reached 29% but not higher. Furthermore, a 29% of accuracy for both training set and testing set also appears when I zero out 99% of the neurons in the input layer (again, no hidden layers when I did that). Even though the parameter values are different in these two cases, I guess that possibly means the 3 additional hidden layers I have in my models do not contribute to the actual learning process as expected (or you can say that it did contribute a lot to overfit the training set, but did no good for improving the accuracy of validation set and testing set).

Unfortunately, I currently don't have any ideas on how to further modify my model and solve this issue. I have tried to change the batch size, learning rate, number of hidden layers, switch to SGD optimizer and apply StandardScaler to the 20k_zero input, however, no improvement for these methods.

The updated version of code will be uploaded shortly after this comment as usual. (I am not sure whether I should add 20k_zero dataset to the repo so I will not commit these datasets in my next git commit)

amartinhuertas commented 1 year ago

Hi @GiteonCaulfied !

Thanks for the report.

Can you try to use more than 3 hidden layers?

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas

I've been trying to use a model with more than 3 hidden layers since last night. My latest model has 7 hidden layers so far and the neurons for each layers are 1028, 514, 200, 160, 120, 80, 120 (I've also tried some other hidden layer structure as well, such as 10 neurons for all 7 hidden layers or 1 neuron for the last hidden layer). Unfortunately, the accuracy is not improved and the performance is still the same as before. (overfitting for training set, no improvement for testing set and validation set)

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas @rhyshawkins

I've implemented a naive systematic testing method, including a new notebook Systematic_testing.ipynb and a text file ModelList.txt that contains a list of different hyperparameters settings for the specific result_20k_zero dataset (including hidden layers neurons, activation function, Loss function type, epoch, batch size and learning rate)

By running the Systematic_testing.ipynb, it will read each line of the ModelList.txt file to build a NN model that aligns with the settings specified in that line. After that, the plot of training loss and validation loss during training, also the plot of the Best/Worst case input and output for training set and testing set will be generated when the code is running.

One thing about this naive systematic testing method is that these plots for different models are not put together in a complete grid structure after each model has been trained and tested. It also can't be selected and viewed with a drop down box in an interactive way like the one plot in the given Julia notebook. Therefore, to compare the performance of each model, one may need to keep scrolling down and scrolling up and that can be kind of inconvenient. I'll try to think of some ways to make it better if possible.

The updated version of code will be uploaded shortly after this comment as usual.

rhyshawkins commented 1 year ago

Hi,

I've pushed the reduced data to the repository in the Data/Reduced subdirectory. There is some problem generating the values on my laptop which I believe is causing some of the issues we are seeing. I've now generated these on a linux box and the behavior is far better with a linear solution doing much better.

A small change is in the naming of the files. Sia and I talked about the geoid problem and if it is too non-linear or chaotic, we can try some of the other forward models in Sia's code. So rather than one input/output file there are now many.

In the first instance the files small_8_1k-inv.npy and small_8_1k-geoid.npy are the input and output files respectively.

There is a simple plotting script plot.py which shows the inputs and outputs as well as the linear solution.

amartinhuertas commented 1 year ago

Sia and I talked about the geoid problem and if it is too non-linear or chaotic, we can try some of the other forward models in Sia's code.

FYI ... Accordingly to Uni Houston' MSc thesis (pg 4.) ... "many previous studies have established a well-posed and unique solution for calculating geoid anomalies from density contrasts in an incompressible, spherically symmetric, layered Newtionian fluid. {REFERENCES}"

It does not explicitly refers to continuity of solution w.r.t. data, but this property is typically included in the definition of well-posedness. Looking at {REFERENCES} could should shed some light in this direction.

amartinhuertas commented 1 year ago

There is some problem generating the values on my laptop which I believe is causing some of the issues we are seeing.

Do you mean a BUG? Can we may be try to solve a problem with known analytical solution and see whether we can trust in the output of the code? Just a suggestion ...

GiteonCaulfied commented 1 year ago

Hi @rhyshawkins @amartinhuertas

I've tried to built a model with small_8_1k-inv.npy as input and small_8_1k-geoid.npy as output by using a simple structure with two hidden layers (neurons for each layer are 20 and 30), the performance is great, both training set and testing set have reached an accuracy of nearly 100%, you can check more details by taking a look at the st1D-invGeoid.ipynb(generated by systematic testing process) or 1D_Geoid.ipynb (manually changed parameters and then tested)

However, I have a problem regarding rest of the files inside the newly uploaded Reduced folder. How many of these dataset are the input data? If I understand correctly, every dataset apart from small_8_1k-geoid.npy are input dataset and all of them have small_8_1k-geoid.npy as their common corresponding output dataset, which means I need to built a different model for each of these input dataset (a total of 6 models) to find some patterns.

rhyshawkins commented 1 year ago

This is excellent. Apologies for the initial corrupted data, but now we know that there is an issue generating the data on my laptop we can build more complicated models and see if the good performance holds.

For this data set I included all the outputs from Sia's code. Sia and I were thinking that if the geoid problem was too difficult we may have had to change to use the surf output instead. Now that the geoid seems to be working, this is not necessary.

For summary, the inputs are small_8_1k-inv.npy is the input small_8_1k-rvsc.npy is the actual input to the forward code (the 8 values are projected to the full 257 input vector)

Outputs: small_8_1k-cmb/geoid/grav/surf/vel.npy

So you still only need be concerned about two files: the inv and geoid files.

I will generate suite of successive datasets for further testing: for example increasing the input vector size from 8 to 16, 32, etc all the way up to 257. Then you can test if a more complex system still works as well.

GiteonCaulfied commented 1 year ago

Hi @rhyshawkins @amartinhuertas

I've tried to built a model with zero_1k-inv.npy as input and zero_1k-geoid.npy as output (These two files are now at Data/Geoid/new_results_1k_zero folder) by using a slightly more complex structure with four hidden layers (neurons for each layer are 200, 160, 120, 80), the performance is great as well, both training set and testing set have reached an accuracy of nearly 100% (I haven't checked the accuracy using different threshold values other than 0.01 and I'll do this later, but you can see from the worst-loss-prediction that the model is great as well), you can check more details by taking a look at the st1D-1k_zero.ipynb (inside 1D_result_notebook folder) or 1D_Geoid.ipynb

In this case, do I still need to proceed on testing the 20k_zero dataset? I haven't done that yet but I do uploaded the two files in another folder located at Data/Geoid/new_results_20k_zero

GiteonCaulfied / COMP4560_stokes_ml_project

Testing of initial dataset #1