Summer 2023: Goals and Plans for the Summer

davidvenuto commented 1 year ago

This issue will serve as an overview of my goals for this summer, broken down by each week

davidvenuto commented 1 year ago

Goals for 6/18-6/25

[x] Sift through Bernard's code
Particularly look at his code on the MEOFs, and identify the dominant modes. These will give context about the predictor variables in the ML model
Find the daily version of Bernard's data, and begin figuring out how to draw these important variables from this data.
Look through Bernard's remaining code of relevance
[x] Start semi-development of time series model ( Too early not enough understanding)
- [x] Possibly perform independent MEOF analysis on new data
- [x] Decide on the proper way to implement model (RNN, CNN, LSTM (most likely LSTM according to Dhruv))
  - LSTM appears to be the way to go. Its ability to capture long term patterns and also deal with non-linearity appeals to the goals of our project. May be a little computationally difficult given the size of the data
  - Other options that were considered: Vector Autoregression, Gated Recurrent Unit, Transformer Model, Gaussian Processes
    - Vector Autoregression: Linear model, so obviously not an ideal choice
    - Gated Recurrent Unit: Struggles with long term dependencies
    - Transformer Model: Computationally expensive and less interpretable than other models
    - Gaussian Processes: Limited scalability to high-dimensional data and requires assumptions about the underlying distribution and kernel functions.
- [x] Strengthen understanding of LSTM model
  - LSTM is a type of recurrent neural network that uses previous inputs to predict the next output (hence recurrent)
  - LSTM requires input-output pairs. This is commonly done through the sliding window technique.
    - Example: Lets say we have number 1-10, and the window value is 2. The first input would be [1,2], and the corresponding output would be 3, making the input-output pair ([1,2],3). The next input would be [2,3] and the corresponding output would be 4, making the second input-output pair ([2,3],4). This pattern would continue until 10.
    - After creating our input-output pairs, we need to convert our training and testing data into numpy arrays and then reshape them to actually use in the LSTM.
    - Then we create our model using Keras, and train this model on the training data.
    - We can validate our predictions on the test data
    - Finally, run model on independent data to determine its precision.
- [ ] Lay out framework of what is necessary to implement this specific model in a separate git-issue
[x] Go back to Chen-Yuan (2004) and summarize understandings, and how these ideas will transfer over to your project
Essentially, Chen-Yuan describes the identification of the dominant modes (spatial distribution between the variables) between a number of variables. From the leading modes, we can identify what variables are the most "important" from which we can begin creating our own model using ML.
Jianna's job will be to come up with a better way of utilizing dimensionality reduction, which will possibly change the modes and thus the variables. The ideal situation is the one in which me and Jianna are able to combine our projects into one.
[ ] Potentially integrate over to orca computer, maybe not if LEAP continues serving purposes for now (Not necessary yet)
[x] Create Powerpoint for LDEO Research Focus Session
- [x] Edit and provide visuals
- [x] Update throughout the week if necessarry
[x] Come back at end of week and review how each of the goals were met, and any complications
[x] Write the goals for next week

Update for end of week

This week went relatively well. I think the main gap that may come back to bite me if I don't fortify it is my understanding of the previous work. I have a basic understanding of MEOFs, the markov model, etc, but not enough to explain it, as was highlighted in my research focus session. I think the next research focus session and our meeting on Monday may be a bit of a wake up call, as it will force me to 1. Really start knowing what I need to be working on, and to get working on it, 2. To actually understand the basis for my project and 3. To start the implementation of my beginner model.
The fact that we are almost halfway through the summer is a bit mind boggling to me, as I feel like I haven't made much progress. I think I need to start being more direct in going to Dhruv and Xiaojun with questions and to really make what I want to happen, actually happen.
A lot of the soft goals for this week were met, but development has yet to start. Once again I think is due to a relatively short understanding of the previous material, but I know I can turn that around next week.

davidvenuto commented 1 year ago

June 20th

It appears that the dominant modes from Bernard's code imply that the important variables are SIC, SST, and SAT.
Planning to talk to Xiaojun as to whether my deductions are correct and where to go from here.
- Talk about whether to use original or anomaly data
- If anomaly, smoothed or not? Using the anomaly data would only help us maybe capture future anomalies, but at that point, why not just use the original data?
- If original data, discuss the process of how we plan to extend the given timeframe (42 years is too short, she mentioned artificially creating data from a monthly rolling mean, will need clarification)
- Also part of the extension of the timeframe is the CMIP6 which is apparently available locally, also get clarification on this, are we not using the same data that Bernard used?
- Dhruv mentioned the other day that an LSTM would likely be the most optimal model. Confirm this with Xiaojun/get her thoughts on it.
- I will likely go back to Bernard's code/Powerpoint Poster and see if I can gain any new insight after conversation with Xiaojun

Meeting went well, see next comment for notes

davidvenuto commented 1 year ago

Summary of Meeting with Xiaojun (6/20)

First we discussed about the specifics of the Markov Model process, and what it means for my project.
- Essentially, a number of variables are input into the MEOFs. From here, the dominant "modes" are found. In this context, modes refer to the spatial distribution or co-variability amongst the variables. Knowing the dominant modes tells me what variables are the most important, and in turn I can use these variables to create my own ML model.
- From here, we began discussing where I should start for my research focus session for this Friday (6/23).
- Bernard used monthly sea ice concentration data for his project, but this same data is too short for my ML purpoes. I am going to need to find the corresponding daily data, which will give me enough data to at least make a beginner ML model.
- Prior to any model making, Xiaojun suggested that I perform my own MEOF analysis on the daily data that I obtain to find the dominant modes and thus the variables I want to use for my model. This may or may not be worth it. After all, if the purpose is to combine with Jianna's project, and I already have Bernard's data as a reference, perhaps he has already found the same things that I would. But I am using daily as opposed to monthly data, so I will likely perform the MEOF to confirm.
- Xiaojun also suggested that I get more informed on the suggested LSTM model, but also research any other potential models and discuss them with her and Dhruv. Off the bat, the LSTM makes sense, but I have not done enough research to conclude whether or not this is definitely the right one. However, it may be the best to at least start out with.

davidvenuto commented 1 year ago

General Roadmap Summer 2023

[x] Week1 (6/5-6/11): Bootcamp Wk1
[x] Week2 (6/12-6/18): Bootcamp Wk2
[x] Week3 (6/19-6/25): 1st research session. Begin delving into actual project and understanding what was done last summer
[x] Week4 (6/26-7/2): Continued understanding of previous work, 2nd research session, and layout of what a model implementation would look like.
[x] Week5 (7/3-7/9): Creation of first model
[x] Week6 (7/10-7/16): Improvements on first model
[ ] Week7 (7/17-7/23): Merge with Jianna's work
[x] Week8 (7/24-7/30): Make any final changes and create poster for final presentation
[ ] Week9 (7/31-8/6): Present poster, finish final paper

davidvenuto commented 1 year ago

Goals for 6/25-7/2

[x] Truly understand the work from Chen-Yuan
- [x] What is an EOF?
  - [x] What is the process?
  - [x] Why do we use it?
  - [x] How is this then expanded to MEOFs?
- [x] What is hindcasting?
- [x] What is cross validation?
- [x] What is the basic process behind the markov model?
- [x] How are all of these concepts found in Bernards's code?
- [x] Understand the cross over from Chen-Yuan to my project
- [x] How will MEOF's be applied to the data I'm using?
- [x] What is the best ML model to be using?
  - [x] Assuming it is LSTM, why? Look for resources on this so you can back up this claim, not just ChatGPT
    - [x] What is the basic process of an LSTM?
  - [ ] Find other options to consider besides LSTM.
- [x] Write Research Session 2 Powerpoint
- [x] Lay out complete, detailed roadmap for entire beginning ML implementation process, and START IT

davidvenuto commented 1 year ago

June 30th, Beginning of First NN on Sea Ice Data

I have started to develop a simple feedforward NN to use on the SI data.
The data is being taken from the PCs of the MEOF analysis applied by Bernard
The model has been created, but seem to be having some problems
- There is a somewhat large gap between the training and validation loss.
- Validation loss intially goes down but curves up about 3/4 of the way through
- This varies with parameters. Initially, validation loss shot right up and oscillated.
- Things I've tried:
- Adjusting # neurons, learning rate, batchsize, # of layers, types of layers added, types of activation functinos used,
- Adding regularizations, such as early callback (did help)
- Changing optimizers from Adam (Tried SGD, made it much worse)
- Changing output layer activation function (Best result has been when output layer is linear)
- Things I still want to try:
- I think the dataset is too small, as theres only roughly 514 points (I think?). At meeting today, will bring up how data may need to be expanded artificially. Not sure what this entails as of right now.
- Talk to Dhruv and ask him what he thinks about the model (is it overfitting, underfitting, are the parameters bad?)

davidvenuto commented 1 year ago

Goals for 7/3-7/9

[x] Finish development of first NN on Sea Ice data. Model is already created but needs to be optimized
[x] Once finished with first NN, move on to an RNN, compare results
[ ] Read more about LSTM, start thinking about architecture for specific implementation
[x] Understand what needs to be done AFTER the model is created. What do we now do with the time series we've created?
[x] Enjoy July 4th

davidvenuto / Summer-2023-LDEO-David-Venuto-Repository

Summer 2023: Goals and Plans for the Summer #1

This issue will serve as an overview of my goals for this summer, broken down by each week

Goals for 6/18-6/25

Update for end of week

June 20th

Meeting went well, see next comment for notes

Summary of Meeting with Xiaojun (6/20)

General Roadmap Summer 2023

Goals for 6/25-7/2

June 30th, Beginning of First NN on Sea Ice Data

Goals for 7/3-7/9