Testing with larger data set

GiteonCaulfied commented 1 year ago

Hi,

I've been testing the new data set by using interpolation to generate new temperature fields with standard time step for each file. However, I can't even generate new temperature fields for one file given 5 hours. There's probably some issues with the way I interpolate the data, I'll see if I can fix this.

I've also applied the new data set on the ConvAE without interpolation and result is good as expected. To reduce the computation time for testing on my local laptop, I've now moved testing code to a seperate script run by Gadi. In this case, I can simply download the testing data from Gadi to visualize it and don't have to load the 40GB data set on my local laptop before testing and visualisation. (For now this change only applies to ConvAE, will implement it for LSTM as well when I've finished training)

In the meantime, I'll test the data set on LSTM without using interpolation to see if the accuracy of the current predictions increases with this larger data set.

sghelichkhani commented 1 year ago

I've been testing the new data set by using interpolation to generate new temperature fields with standard time step for each file. However, I can't even generate new temperature fields for one file given 5 hours. There's probably some issues with the way I interpolate the data, I'll see if I can fix this.

How are you doing the interpolation? For this problem what you don't want to do is passing everything into a scipy interpolate, as it tries to load everything into memory. What I usually do in these cases I try to set up a class that builds a tree and knows the two nearest neighbors for every time that it is queried. So this would a typical skeleton for that (Note that using cKDTree is probably an overkill for this problem. But nevertheless)

import h5py
import numpy as np
from scipy.spatial import cKDTree

class HDF5Interpolator:
    def __init__(self, filename):
        # Load the HDF5 file
        with h5py.File(filename, 'r') as file:
            self.timestamps = file['timestamps'][:]  # Assuming the timestamps are stored as a dataset
            self.arrays = [file[f'array_{i}'][:] for i in range(len(self.timestamps))]  # Replace with actual path to your arrays

        # Build the KDTree
        self.kdtree = cKDTree(self.timestamps.reshape(-1, 1))  # Reshape to meet KDTree input requirement

    def interpolate(self, time):
        # Find the two nearest timestamps
        distances, indices = self.kdtree.query(np.array([[time]]), k=2)
        t1, t2 = self.timestamps[indices[0]]

        # Ensure the time is within the bounds of the timestamps
        if t1 > time or t2 < time:
            raise ValueError(f"Time {time} is out of bounds of the data timestamps.")

        # Get the corresponding arrays
        array1, array2 = self.arrays[indices[0][0]], self.arrays[indices[0][1]]

        # Interpolate between the two arrays
        alpha = (time - t1) / (t2 - t1)
        interpolated_array = array1 * (1 - alpha) + array2 * alpha

        return interpolated_array

# Usage:
interpolator = HDF5Interpolator('your_file.hdf5')
interpolated_array_at_t = interpolator.interpolate(5.5)  # Assume we want to interpolate at time = 5.5

amartinhuertas commented 1 year ago

I've now moved testing code to a seperate script run by Gadi.

Please do not hesitate to burn computing time in Gadi. I am indeed in a rush, as the current quarter ends Saturday night, and I have to burn all the computing time devoted to that quarter by then.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas,

I've now successfully generated the interpolated temperature fields for all the files and started to train the ConvAE using this interpolated data set.

However, I am actually thinking about the usage of this interpolated data set. The previous problem of the FNN-predicted GIF moving too fast/slow is caused by the interval difference between each pair of time stamps (e.g. T2 - T1 = 1 but T3 - T2 = 1.6) when we train the model.

By using interpolation, we can generate a new set of temperature fields with the interval difference between each pair of time stamps same (e.g. T2 - T1 = T3 - T2 = ... = 1). By using the interpolated data set, training can avoid being interfered. And the testing result using the interpolated data set should be great if nothing goes wrong.

However, when testing with original data set, we lose this advantage since the time intervals of the original data set are not the same but the FNN we trained assume they are the same. In this case, the problem of predicted-GIF moving too fast/slow should occur again.

Overall, I am a bit confused about the usage of the interpolated data set right now, any insight would be really helpful.

sghelichkhani commented 1 year ago

@GiteonCaulfied I completely agree with all that you said. But I am a bit confused why you think this is a problem. Didn't we want to redo the same exercise as with the previous dataset, but this time feed the NN with the interpolated dataset, so that we would not have the issue with time-steps anymore, plus now we have way more training sets?

GiteonCaulfied commented 1 year ago

Hi @sghelichkhani ,

I think I get it now. We are actually testing the performance of different NN architectures given a data set not having the issue of time-steps anymore (that is, the interpolated data set generated from the original 40GB data set).

In this case, there is no need to test our model using the original 40GB data set that having the time-steps issue since we already know what will happen (predicted-GIF moving too fast/slow) when we tested on the 2GB data set earlier.

sghelichkhani commented 1 year ago

@GiteonCaulfied I think what you are saying makes absolute sense! We could actually make it into three sections for your final report: 1 - Testing a limited data-set that has the issue of varying time-steps (Already Done) 2 - Testing a larger data-set that is now interpolated so it does not have the issue with varying time-step (Going to Do) 3 - Reducing the size of the training set in 2, to make it comparable with 1, but with interpolated fields, so the time-step is constant. This way we can analyse the effect of data-set versus timestep chaning. (Do it only if you think you will have the time).

Obviously depending on what you yourself and @amartinhuertas @rhyshawkins think it's best.

amartinhuertas commented 1 year ago

We could also try a 4th option: larger data-set with varying time-steps, just to confirm whether the issue is the varying time step or the scarceness of data.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas,

I've finished testing the interpolated larger dataset and rearranged some files and directories. The animation for the LSTM and FNN can be seen in the directory 2D_GIFs. For further testing result, check the 3 visualization notebooks for ConvAE, FNN and LSTM.

Overall, the problem of predicted GIF moving too fast/slow is now gone as expected. Apart from that, However, there's no significant improvement on the animation for both FNN and LSTM.

We could also try a 4th option: larger data-set with varying time-steps, just to confirm whether the issue is the varying time step or the scarceness of data.

I'll try this option as my next step, since I've already trained a ConvAE using the original 40GB data set.

Since the previous result for the limited dataset can not be seen in the visualization notebooks now, I've just added the last part of these result (LSTM) in my draft report yesterday so that you can check them for comparison. I will commit the updated report shortly after this comment.

amartinhuertas commented 1 year ago

Hi @GiteonCaulfied,

something that I believe worths exploring.

Can we try to determine experimentally for how many time steps we can use the trained FNN without loosing track of the transient dynamics?

It would be the following idea: Use the FNN during a set of S consecutive time steps, with, e.g., S=2, S=4, or S=8, and then "correct" the time series with the truth coming from the simulator.

If this idea is sensible, and improves the current outcomes, one could also try to answer (although may be not in this project): what happens if we "correct" the time series with the projected temperature fields into the fine-grid scale time and space discretization coming from a coarser time- and space-resolution simulation?

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas,

Can we try to determine experimentally for how many time steps we can use the trained FNN without loosing track of the transient dynamics?

It would be the following idea: Use the FNN during a set of S consecutive time steps, with, e.g., S=2, S=4, or S=8, and then "correct" the time series with the truth coming from the simulator.

I totally agree that this is worth exploring and I plan to test on the FNN model trained using the interpolated dataset (without the issue of the predicted GIF moving too fast/slow). I can use the total loss for the entire test set to determine the best time steps we can use the trained FNN in an output-feed-as-input loop.

However, I am kind confused how we can further visualize these result using the best/worse case animation GIF. The current best/worst case is determined by the total loss value of a simulation when the trained FNN is using in an output-feed-as-input loop. If we want to test with a set of different S consecutive time steps, with, e.g., S=2, S=4, or S=8, which time steps should we choose to evaluate a best/worst case (e.g. Best case for S=8, and the result for this best simulation when S = 2, S =4 ...)? Or should we test and visualize the best/worst case separately for these different time steps instead of trying to visualize them all together (e.g. Best case for S=8, Best case for S = 2, ...)? Or we can only visualize the best/worst case of the best time steps? (e.g. Best case for S=8 suppose S=8 is the best time step)

Also, the free storage of Git LFS is approaching its limit since I use Gadi generate test data which I then download to my repo and visualize it. In this case, I plan to delete some of the previous trained NN models (or all of them to spare more space for the test data) and move them into a shared Google Drive folder which I can refer to in my repo or report.

amartinhuertas commented 1 year ago

S is parameter of the approach.

The result (i.e., full time series) you are going to get with S=2 is different from the one corresponding to S=4, or S=8. (I think you got that, but just in case).

I would take the PCA of the simulation as a reference to compare against. The closer you are to the PCA of the simulation, the better. We can take, e.g., the Euclidean norm of the difference divided by the Euclidean norm of the PCA of the simulation.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas ,

I would take the PCA of the simulation as a reference to compare against. The closer you are to the PCA of the simulation, the better. We can take, e.g., the Euclidean norm of the difference divided by the Euclidean norm of the PCA of the simulation.

I've tested a set of different S consecutive time steps where S= 2, 4, 8 or 16 and take the Euclidean norm of the sequence difference divided by the Euclidean norm of the PCA difference of the simulation as a reference to compare them.

The result is shown as below:

Step 2 seems to be the best of them, but that doesn't give much useful information since the pattern here just means that the less the step is, the more accurate the sequence of prediction is. Maybe what we want to find here is a time step that is both as large as as possible and its loss-divided-by-PCA is not too far away from the single prediction method (when S=1)?

If that is the case, I'll add more consecutive time steps to test with and also use S=1 as our contrast.

amartinhuertas commented 1 year ago

Ok, these results make sense. The larger the S, the worse the prediction in terms of PCA terms. You are measuring this for a particular instance of the initial conditions, right? Or for the whole data set?

In any case, I am not sure if I understand how you are computing "Euclidean norm of the sequence difference divided by the Euclidean norm of the PCA difference of the simulation as a reference to compare them." Can you show me the code you are using? (not saying it is wrong, I just want to double check that we are on the same page).

Can you try S=number of total time steps? (i.e., what we were doing till now).

Maybe what we want to find here is a time step that is both as large as as possible and its loss-divided-by-PCA is not too far away from the single prediction method (when S=1)?

Yes. This is what we want. How large can S be without significantly affecting accuracy ... sort to say.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas,

Ok, these results make sense. The larger the S, the worse the prediction in terms of PCA terms. You are measuring this for a particular instance of the initial conditions, right? Or for the whole data set?

I am measuring the whole data set. (Calculate the PCA difference for each file and add them up.)

In any case, I am not sure if I understand how you are computing "Euclidean norm of the sequence difference divided by the Euclidean norm of the PCA difference of the simulation as a reference to compare them." Can you show me the code you are using? (not saying it is wrong, I just want to double check that we are on the same page).

I just found that I did make a small mistake when calculating the data loss, here's the updated version of the code after I fixed it.

For your reference, I am using l2 norm (np.linalg.norm) to calculate the difference of PCAs and data, where S_predicted_j.diagonal() is the eigenvalues of the predicted temperature fields series given a time step (e.g. S=4) and S_oiginal.diagonal() is the eigenvalues of the original temperature fields series stored in a file. Also, predicted_temperature_fields_list[j] is the predicted temperature fields series (100x201x401) and testing_temperature_fields is the original temperature fields series stored in a file (100x201x401)

Can you try S=number of total time steps? (i.e., what we were doing till now).

Here's the updated version of the result, with S = 0 and S = 99 included. The result makes more senses now after I fixed the mistake.

The repo will be updated shortly after this comment.

amartinhuertas commented 1 year ago

I am measuring the whole data set. (Calculate the PCA difference for each file and add them up.)

Ok, I would also calculate minimum, maximum, average, and std dev of the errors.

I just found that I did make a small mistake when calculating the data loss, here's the updated version of the code after I fixed it.

Ok, no worries. I would divide norm2() of the difference by norm2 of S_original, so that you have relative errors.

GiteonCaulfied commented 1 year ago

Hi @amartinhuertas,

Ok, I would also calculate minimum, maximum, average, and std dev of the errors.

Ok, no worries. I would divide norm2() of the difference by norm2 of S_original, so that you have relative errors.

I've uploaded an updated version of the testing result on FNN trained with interpolated data set, including minimum, maximum, average, std dev of the errors and also the relative PCA differences. You can find the newest result in the following path: 2D_FNN_results/larger_dataset(interpolated)/FNN_testingData_Gadi_3.txt

We could also try a 4th option: larger data-set with varying time-steps, just to confirm whether the issue is the varying time step or the scarceness of data.

I also finished training and testing with the 4th option you mentioned, which confirmed that the issue is the varying time step.

RichardScottOZ commented 6 months ago

Datasets appear to be inside-ANU, is this correct?

GiteonCaulfied commented 6 months ago

Hi @RichardScottOZ ,

Thank you for your interest in this project!

Datasets appear to be inside-ANU, is this correct?

I have checked those links to the Mantle Convection datasets and currently one can only access these datasets with an ANU account. I'll see if I can find an alternative way to access these datasets by uploading them to somewhere else such as Google Drive. (if it is permitted by my supervisors)

In the meantime, you can check out the data generator for these datasets located in Data/2d_stokes_solver/base.py

GiteonCaulfied commented 6 months ago

@RichardScottOZ Also, may I ask which is your interest in this project?

RichardScottOZ commented 6 months ago

This might give you an idea : - https://github.com/RichardScottOZ/mineral-exploration-machine-learning

Basically - professional interest when someone tries something relevant to this domain.

Better answer as not in the middle of something now, sorry.

RichardScottOZ commented 6 months ago

Hi @RichardScottOZ ,

Thank you for your interest in this project!

Datasets appear to be inside-ANU, is this correct?

I have checked those links to the Mantle Convection datasets and currently one can only access these datasets with an ANU account. I'll see if I can find an alternative way to access these datasets by uploading them to somewhere else such as Google Drive. (if it is permitted by my supervisors)

In the meantime, you can check out the data generator for these datasets located in Data/2d_stokes_solver/base.py

Thanks, will take a look.

sghelichkhani commented 6 months ago

Hi @RichardScottOZ ! Many thanks for your interest. Could you check if the new link works: https://anu365-my.sharepoint.com/:u:/g/personal/u1093778_anu_edu_au/EXSV10A0DtpDodSZwbpm06wBK6dFLS-MUfbVCc9PIE4t6g?e=o4tD7q

If you still had issues please send me an email vis siavash.ghelichkhan@anu.edu.au

RichardScottOZ commented 6 months ago

I get to a tar archive by the looks ... Will try the download when back at a desk thanks!

RichardScottOZ commented 6 months ago

Looks like it is downloading ok, will let you know when I have it thank you!

RichardScottOZ commented 6 months ago

Significant number of hdf5 files by the looks - I did get an end of archive error - so possible download issues - would you be able to tell me how many there should be please?

sghelichkhani commented 6 months ago

There are 100 hdf files in that tar file. As mentioned by Alberto, the files are generated by randomly initialising a temperature field using specific choices of domain and temperature field and then convecting them forward using stokes to get temperature distribution.

I guess if there was a less cryptic way of saying what your aim is we could be more helpful about this.

RichardScottOZ commented 5 months ago

Sorry, don't mean to be cryptic.

See linkedin if you want more random stuff :- but the tldr is copper mines, how to find.

From however deep - so when I saw your project looking at that, and doing some trials with neural networks I thought - that looks interesting, should investigate further.

The Mantle Convection part - don't care so much about geoids.

If you want to get more of an idea of things along those lines :- https://www.earthbyte.org/stellar/

RichardScottOZ commented 5 months ago

and I will need to redownload it seems.

RichardScottOZ commented 5 months ago

Looks good now - just a general comment autoencoder and LSTM use also piqued my interest.

RichardScottOZ commented 5 months ago

picking solution_0 as an example

GiteonCaulfied / COMP4560_stokes_ml_project

Testing with larger data set #7