Meeting Outcomes 24/08/23

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas

In today's meeting we've mainly discussed about two things: implementation of the current auto-encoder and potential ways to build the neural network to predict the temperature field for the next time stamps.

For the auto-encoder, the current one doesn't perform quite well when it comes to some "complex" temperature fields. Therefore, Rhys suggested to increase the size of latent space by either removing one convolutional layer or decrease the stride value so that the latent space should be able to capture more features of the original temperature field.

Also, the current auto-encoder is only trained for those temperature fields that have a specific pair of consecutive timestamps. (So only 160 input for the training set in this case)After the discussion, we decided that it would be better if we feed all the temperature field (all 10000 of them) as the training/testing/validation data. In this way, we no longer need to build 99 auto-encoders when it comes to 100 timestamps and we can just use one auto-encoder to do all the work. I will test both these things before our next meeting.

For the neural network to predict the temperature field after we transform the input temperature field using auto encoders, there could be two ways of doing this. One is to train multiple NNs and make each of them specifically working for a single set of consecutive timestamps (one NN for 0->1, one NN for 1->2, ...), the other is to only train one huge NN to do all this work (same NN for 0->1 and 1->2 and ...). We've discussed about this in our last meeting and decided to use the first method two weeks ago but Rhys said he preferred the second method in today's meeting. Let's leave the final decision to the meeting on next Monday. In the meantime, I will just focus on the implementation of the auto-encoders for now.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas @rhyshawkins

I've updated the auto-encoder by removing one convolutional layer and feed it with all 10000 temperature fields. The size of the latent space is now increase to 6x23x45, which is about 13 times smaller than the size of the original input (1x201x401). As expected, the ConvAE is now able to capture more features and the performance is better than the last commit (less spiky edges for the 10-region-color-map).

One problem is that the training now takes about 2-3 hours to complete since I use all 10000 temperature fields to do the training/testing/validation. (I stopped the training process manually after 180 epochs since it took too much time) So I may have to use Gadi if we want to test with a larger dataset.

sghelichkhani commented 1 year ago

@GiteonCaulfied This is indeed what I was hoping that would happen. Would be nice to think about how to deal with larger data sets already. I reckon the larger the dataset, the predictability limit of the system would be longer.

amartinhuertas commented 1 year ago

Hi @GiteonCaulfied ! I have sent an invitation to my kr97 project at Gadi. You should have got an email message with the invitation. Let me know if this is not the case. The project has 50 KSUs budget till the the last day of September.

Misc info and links to get you started with Gadi can be found here: https://github.com/gridap/GridapDistributed.jl/wiki/Gadi-(NCI)-Useful-links,-commands,-and-workflows

amartinhuertas commented 1 year ago

I would guess that you would have to run the code in the gpuvolta partition (NVIDIA V100 GPUs)

https://opus.nci.org.au/display/Help/Queue+Structure#QueueStructure-Overview

In regards to where to upload the massive data set, the home directory only has 10 GBytes of quota. You should thus store it the scratch filesystem, which has much larger capacity

https://opus.nci.org.au/display/Help/Gadi+Quick+Reference+Guide

amartinhuertas commented 1 year ago

This also might be useful (from one of the courses I teach):

https://gitlab.cecs.anu.edu.au/comp4300/2023/comp4300-lab1

To understand SU-consumption for the jobs that you submit:

https://opus.nci.org.au/display/Help/2.2+Job+Cost+Examples

To check current SUs availability, execute the following command from a Gadi terminal:

nci_account -P kr97 -v

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas

I've now successfully trained the ConvAE using Gadi and cut down the running time from 2-3 hours to 16 minutes. The training file is ConvAE_training.py and the job script submitted to Gadi is ConvAE_training_job.sh. The training file will output the training loss and validation loss during training as a text file called ConvAE_trainingData_Gadi.txt, along with the parameters for the best encoder and decoder in the file Conv2D_encoder_best_Gadi.pth and Conv2D_decoder_best_Gadi.pth.

After I manually downloaded these files from the remote server after training, they are put in a folder called 2D_ConvAE_results. A Jupyter Notebook called ConvAE_visualisation.ipynb will be used for visualising the result, much like what I did in the 1D problem. The performance is same as the last commit.

Also, I've renamed some of the files from the 1D problem to distinguish them from the files in the 2D problem.

The updated files will be pushed shortly after this comment.

GiteonCaulfied / COMP4560_stokes_ml_project

Meeting Outcomes 24/08/23 #4