fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.
Other
4 stars 5 forks source link

Time Series Forecasting #66

Open soroush-ziaeinejad opened 1 year ago

soroush-ziaeinejad commented 1 year ago

@Sharjeeliv

Hello Sharjeel, this issue page is created for the task of time series forecasting for user similarity matrices. Here is the detailed explanation for this task: When UML (user modeling layer) is done, we have T different U*U matrices that keep the similarity between U users for T time intervals. Suppose we have 60 time intervals and 1000 users. So we have 60 matrices with shape1000*1000. Your task is to predict the user similarity matrix for time interval T+1 (61 in our case).

Motivation: Our current approach is to generate T graphs from these T matrices (we consider the matrices as adjacency matrices) and then we build T embedded matrices with shape U*d (d=embedding_dimension) such that matrix t is calculated based on matrices 1 to t-1. This is called temporal graph embedding which is done in the graph embedding layer (GEL) in SEERa. The main problems with this approach are:

  1. generating graphs of adjacency matrices is time consuming.
  2. saving the generated graphs in one 3D matrix (T*U*U) is memory and time consuming.
  3. SEERa is going to be a one-stop-shop framework. So we want to add different variations and approaches (baselines) for each task for the sake of comparison and comprehensiveness of our framework.
  4. now we use the latest embedded matrix. Going over the generated embedded matrices may increase the model performance.

Your subtasks:

  1. generate toy datasets: Please write a code that takes T, U, and scenario as inputs and generates T files each containing a U*U matrix as the user similarity matrix, and save them in a folder as well as their corresponding heatmap. The scenario can be one of these three:
  1. implement the simplest possible LSTM using Pytorch to be trained on the first k toy time series and predicts the _k+1_th and calculate the prediction accuracy (or error) using the real _k+1_th matrix as the ground truth. In this way, we can gradually see the performance of the LSTM model.
  2. eventually extend the model (in terms of parameters and the number of layers and neurons) to see the effects of each parameter in order to achieve the best model for this task.
  3. train the model on our original dataset (which I will provide later).
  4. tune the model for our dataset to achieve the best results.

Please start with subtask 1 (toy dataset generating) and we will set regular meetings to discuss your findings and make changes if required. If you have any questions about your first subtask (or the whole task) please do not hesitate to contact me or ask for help.

@hosseinfani, Please share your comments and thoughts on this task. It will be really appreciated.

Thanks in advance!

soroush-ziaeinejad commented 1 year ago

Hey @Sharjeeliv ,

Any updates on toy dataset generating? Please let me know once this task is finished. By the way, I forgot to mention that keep the user similarities between 0 and 1.

Thanks :)

Sharjeeliv commented 1 year ago

Hi Soroush, I'm still working on the task (needed to spend time reviewing SEERa's code again). A few initial questions:

I will have more consistent and frequent updates starting next week.

soroush-ziaeinejad commented 1 year ago

Hi @Sharjeeliv

Yes, reviewing SEERa's code will definitely help. However, the first steps of your task are not directly related to SEERa. Let's focus on your first subtask which is toy dataset generating. We need data to trace the models that we are going to implement in SEERa and we want to generate it in this step.

A desired file structure for the data can be like this: |- data.toy |--- scenario.1 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy |--- scenario.2 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy |--- scenario.3 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy

And this is how the data (user similarity matrices) should be look like for 3 users (U=3) for scenario.1 (increasing): day.1.npy = [[1,0.1,0.2] [0.1,1,0.3] [0.2,0.3,1]]

day.2.npy = [[1,0.3,0.4] [0.3,1,0.5] [0.4,0.5,1]]

day.3.npy = [[1,0.5,0.6] [0.5,1,0.7] [0.6,0.7,1]]

day.4.npy = [[1,0.8,0.9] [0.8,1,1] [0.9,1,1]]

Your task is to write a function in Python that takes T, U, and scenario number, and generates T different user similarity matrices as T .npy files and saves them. Please let me know if you need more clarifications.

Sharjeeliv commented 1 year ago

Perfect, thank you. I was confused on how task 1 was linked to SEERa's code, this helped clear it up. I'll have this done asap.

soroush-ziaeinejad commented 1 year ago

Thanks @Sharjeeliv

Also, you can use the simplest way to increase and decrease user similarities. You can generate a random matrix for the first day, then increase it by (1-initial_value)/T for each day. You can use any other approaches that you may prefer. This was just a suggestion.

soroush-ziaeinejad commented 1 year ago

Hello @Sharjeeliv

I am writing to kindly request your prompt attention to the task at hand. We are in need of the output of your first subtask (toy dataset) in order to work on another issue (#71). If you could complete the task as soon as possible, it would be greatly appreciated. Please keep me posted.

Thank you for your cooperation.

Sharjeeliv commented 1 year ago

Yep, I am almost done -just working on some kinks, it will be complete and posted by tonight.

Sharjeeliv commented 1 year ago

Hi @soroush-ziaeinejad,

This is the completed task one, since it's not part of SEERa I posted the code here but I can make a PR with just this file if everything looks good to you. Also for scenario 3, I split the combinations and incremented half and decremented the other half.

This is how the saved files and the dataset looks when it's generated

image image

import os

import numpy as np
from math import ceil

ROUNDING_FACTOR: int = 2

def generate_dataset(time_interval: int, users: int, scenario: int):
    dataset = np.round(np.random.uniform(0.0, 1.0, size=(users, users)), ROUNDING_FACTOR)
    mask = inc_and_dec_scenario_mask(users)
    save_dataset(dataset, scenario, 1)
    print("Original dataset:\n", dataset, "\n")

    for day in range(2, time_interval + 1):  # The loop begins at day 2, as day 1 is the random dataset
        temp = generate_dateset_change(dataset, day)
        # print(f"Change on iteration {day}:\n", temp, "\n")

        if scenario == 1:
            dataset += temp
        elif scenario == 2:
            dataset -= temp
        elif scenario == 3:
            dataset += temp * mask
        else:
            print("Scenario entry must be between 1-3")
            return

        dataset = dataset.clip(0, 1)  # Value must be between 0 and 1
        save_dataset(dataset, scenario, day)
        # print(f"New dataset on iteration {day}:\n", dataset, "\n")

def generate_dateset_change(np_array: np.ndarray, day: int) -> np.ndarray:
    return np.round((1 - np_array[:]) / day, ROUNDING_FACTOR)

def inc_and_dec_scenario_mask(users: int) -> np.ndarray:
    size = users * users
    temp = np.ones(size)
    temp[ceil(size / 2):] *= -1
    return temp.reshape([users, users])

def save_dataset(dataset: np.ndarray, scenario: int, day: int):
    dest = f'../data.toy/scenario.{scenario}'
    if os.path.exists(dest):
        np.save(dest + f'/day.{day:03d}', dataset)
    else:
        os.makedirs(dest)
        np.save(dest + f'/day.{day:03d}', dataset)

if __name__ == '__main__':
    generate_dataset(3, 3, 2)
soroush-ziaeinejad commented 1 year ago

Hey @Sharjeeliv

Thank you for the clean and clear code! I just tested your code and it is exactly what I asked for. It seems we are done with the first subtask. I will provide a description for the second one tomorrow so you can start working on it.

soroush-ziaeinejad commented 1 year ago

@Sharjeeliv ,

There is a bug in the code. These matrices are user similarity matrices and should be symmetric. I think if you initialize symmetric matrices, they remain symmetric until the end. Can you please address this issue and update the code?

Thank you :)

Sharjeeliv commented 1 year ago

@soroush-ziaeinejad,

I changed it to generate like this -this is a quick fix until I can figure out how to natively generate a symmetric matrix. How would you expect it to look for scenario 3? Since the matrix is not evenly divided in half if it is odd.

image

Scenario 3 - Current

image
soroush-ziaeinejad commented 1 year ago

Thank you @Sharjeeliv

Do you mean splitting them in "half"? If the problem is this, it's not a strict half, I expect increasing similarity for some users and decreasing for others.

By the way, please keep the diameter of all matrices as 1 because the similarity between a user and herself is always 1 for us. Sorry, I forgot to mention it before.

Please share the code when it's finished.

Sharjeeliv commented 1 year ago

@soroush-ziaeinejad,

So in scenario three if the matrix is no longer symmetric that is ok?

I implemented the changes, here is the new code. Please let me know if anything else needs to be changed.

This is how the a generated matrix looks:

image
import os

import numpy as np
from math import ceil

ROUNDING_FACTOR: int = 2

def generate_dataset(time_interval: int, users: int, scenario: int):
    dataset = generate_user_similarity_matrix(users)
    mask = inc_and_dec_scenario_mask(users)
    save_dataset(dataset, scenario, 1)
    print("Original dataset:\n", dataset, "\n")

    for day in range(2, time_interval + 1):  # The loop begins at day 2, as day 1 is the random dataset
        temp = generate_dateset_change(dataset, day)
        # print(f"Change on iteration {day}:\n", temp, "\n")

        if scenario == 1:
            dataset += temp
        elif scenario == 2:
            dataset -= temp
        elif scenario == 3:
            dataset += temp * mask
        else:
            print("Scenario entry must be between 1-3")
            return

        dataset = dataset.clip(0, 1)  # Value must be between 0 and 1
        save_dataset(dataset, scenario, day)
        print(f"New dataset on iteration {day}:\n", dataset, "\n")

def generate_user_similarity_matrix(users: int) -> np.ndarray:
    dataset = symmetrize(np.random.uniform(0.0, 1.0, size=(users, users)))
    np.fill_diagonal(dataset, 1)  # A User is similar to themselves
    return np.round(dataset, ROUNDING_FACTOR)

def symmetrize(np_array: np.ndarray)-> np.ndarray:
    return (np_array + np_array.transpose()) / 2

def generate_dateset_change(np_array: np.ndarray, day: int) -> np.ndarray:
    return np.round((1 - np_array[:]) / day, ROUNDING_FACTOR)

def inc_and_dec_scenario_mask(users: int) -> np.ndarray:
    size = users * users
    temp = np.ones(size)
    temp[ceil(size / 2):] *= -1
    return temp.reshape([users, users])

def save_dataset(dataset: np.ndarray, scenario: int, day: int):
    dest = f'../data.toy/scenario.{scenario}'
    if os.path.exists(dest):
        np.save(dest + f'/day.{day:03d}', dataset)
    else:
        os.makedirs(dest)
        np.save(dest + f'/day.{day:03d}', dataset)

if __name__ == '__main__':
    generate_dataset(3, 3, 1)
soroush-ziaeinejad commented 1 year ago

@Sharjeeliv

Hello Sharjeel, this issue page is created for the task of time series forecasting for user similarity matrices. Here is the detailed explanation for this task: When UML (user modeling layer) is done, we have T different UU matrices that keep the similarity between U users for T time intervals. Suppose we have 60 time intervals and 1000 users. So we have 60 matrices with shape10001000. Your task is to predict the user similarity matrix for time interval T+1 (61 in our case).

  1. implement the simplest possible LSTM using Pytorch to be trained on the first k toy time series and predicts the _k+1_th and calculate the prediction accuracy (or error) using the real _k+1_th matrix as the ground truth. In this way, we can gradually see the performance of the LSTM model.

@Sharjeeliv As we move forward, I would like you to focus on the second subtask. You can make use of the code you have already created to generate user similarity matrices for a period of 30 days for 1000 users across all scenarios. This generated dataset will be your starting point.

Next, I would like you to implement a basic LSTM model with a single hidden layer using Pytorch. Train the model using the dataset, and then generate a prediction of another user similarity matrix for the upcoming time interval (31st day in this case).

Please note that I do not expect you to complete this entire subtask within a week, as progress is of utmost importance. Take your time to familiarize yourself with the concepts of Neural Networks and LSTM if you are not already familiar with them. Your findings and progress updates are important to us, so please keep us informed of your progress.

Please let me know if you need help or face any issues during this step.

Thanks

Sharjeeliv commented 1 year ago

Since I'm not familiar with any of the topics in-depth, I'll spend the weekend studying the topics and learning PyTorch. I have found a tutorial on LSTM model with PyTorch so I'll start with that soon after.

Sharjeeliv commented 1 year ago

Hi Soroush, I'll need to revise my original statement -it's going to take me longer than a weekend. I found a good course from Facebook which covers an introduction to neural networks with a focus on recurrent networks (has a section for LSTM) and PyTorch. I'm working through the course and doing the practice they have for PyTorch and NN's.

Course: https://www.udacity.com/course/deep-learning-pytorch--ud188

soroush-ziaeinejad commented 1 year ago

Hi @Sharjeeliv ,

Thanks for the update. That's perfect this course will definitely help to improve your knowledge about NNs and PyTorch. However, we usually don't need the deep theory behind NNs and their variations. Whenever you feel you are dealing with too much information it probably means you are focusing on theoretical details which can be skipped at this step.

Please do not hesitate to let us know if you face any issues or questions. We will try our best to help.

Have an adventurous journey into the NN world!

soroush-ziaeinejad commented 1 year ago

Hi @Sharjeeliv ,

I know this task may be taking longer than the previous one, but I want to encourage you to keep track and report any small progress that you have. Reporting your activities, even if you haven't achieved the desired results, is an essential skill that will benefit you greatly. Don't hesitate to share your understanding, every small experience or experiment you've tried, or any obstacles you've faced. So please keep us updated on your progress, and let us know how we can help.

Thanks :)

Sharjeeliv commented 1 year ago

Whoops, I will post updates more frequently now. This is what I've done/am doing so far and my current understanding:

For each of the topics, I've been taking notes, doing practice coding, writing formulas, etc. I also ended up learning Latex because writing math notes is awful otherwise.

The main issue I've had is balancing everything with my assignments/courses, so I've been making progress much slower than I would like, but it's been steady, and SEERa is making much more sense to me now. I intend to try to finish Pytorch section before Monday and take a crack at the task, this course does have a topic for LSTM and a section on time-series forecasting, so worst case I'll be able to do the task as I finish that section.

soroush-ziaeinejad commented 1 year ago

@Sharjeeliv

Awesome! This is a perfect sample of a desired report of progress. Please keep going and keep us updated about your new steps. Thank you :)

Sharjeeliv commented 1 year ago

A short pre-update: I'm finishing up what I set out in my last update. Since I mostly make progress on the weekend I'll be *giving my updates every Monday.

*trying to

Sharjeeliv commented 1 year ago

An update on my progress, this content was slower to get through than the earlier section:

The goal is still the same, to finish this content and get started with the task. I am probably going to start on the task in parallel since from the tutorial it just seems like an application.

soroush-ziaeinejad commented 1 year ago

It seems you are getting ready for hands-on experience in deep neural networks! Does the course contain any coding assignments or practicing materials? @Sharjeeliv

Sharjeeliv commented 1 year ago

Updated for brevity and accuracy

Yep, the course has several coding assignments focused on implementing various models (Default, CNN, RNN, etc.) with PyTorch.

As for an update, I’ve finished the core theory, particularly the following topics.

This is a brief summary of the core topics I’ve covered. I’m currently working on the hands-on PyTorch section.

Sharjeeliv commented 1 year ago

Update 1/3 - PyTorch

The following is a final compilation of key points learned or accomplished during the course of the task.

The following are key points learned from the PyTorch course section:

Sharjeeliv commented 1 year ago

Update 2/3 - RNN & LSTM

The following are key points learned from the RNN course section:

Sharjeeliv commented 1 year ago

Update 3/3 - LSTM Draft Model

This is a refactored and significantly cleaned-up model, the output is still far off from what is expected. Although I was able to understand certain parts of making the model (those that were covered in the course) other parts like the LSTM hidden and cell states were confusing. Moreover getting the proper shapes and arrangement also ended up being trial and error. The training and testing were familiar but still time-consuming to get it to work.

import torch
import torch.nn as nn
import numpy as np

PATH = '/Users/sharjeelmustafa/Documents/02 Work/01 Research/Y3-22F/SEERa/data.toy/scenario.1/day.'

def get_input(t: int):
    sequence = []
    for i in range(1, t + 1):
        data = np.load(f"{PATH}{i:03d}.npy").astype(np.float32)
        # print(f"{torch.tensor(data).view(-1)}\n")
        sequence.append(torch.tensor(data).view(-1))
    return torch.stack(sequence).unsqueeze(1)

if __name__ == '__main__':

    t = 8
    input_dim, n_layers, batch_size, hidden_dim = 9, 1, 1, 9  # Extensive trial and error needed
    num_epochs = 500  # Epochs

    # Define model -shorthand
    lstm = nn.LSTM(input_dim, hidden_dim, n_layers)
    inputs = get_input(t)
    hidden_state = torch.randn(n_layers, batch_size, hidden_dim)
    cell_state = torch.randn(n_layers, batch_size, hidden_dim)

    criterion = nn.MSELoss()  # Error loss function
    optimizer = torch.optim.Adam(lstm.parameters(), lr=0.0001)

    for epoch in range(num_epochs):
        # Prepare for training
        lstm.train()
        optimizer.zero_grad()

        # Needed t odo this otherwise it crashed
        hidden_state = hidden_state.detach()
        cell_state = cell_state.detach()

        # Forward pass - Still unclear on this
        out, hidden = lstm(inputs, (hidden_state, cell_state))
        hidden_state, cell_state = hidden

        # Computing the loss
        target = torch.tensor(np.load(f"{PATH}{t + 1:03d}.npy").astype(np.float32)).view(-1)
        loss = criterion(out[-1, :, :], target)

        # The backpropagation and updating step
        loss.backward()
        optimizer.step()

        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}")

    # Model for inference i.e., evaluation
    lstm.eval()
    out, _ = lstm(inputs, (hidden_state, cell_state))
    final_output = out[-1, :, :].view(3, 3)
    print(f"Prediction\n{final_output}")