Open soroush-ziaeinejad opened 1 year ago
Hey @Sharjeeliv ,
Any updates on toy dataset generating? Please let me know once this task is finished. By the way, I forgot to mention that keep the user similarities between 0 and 1.
Thanks :)
Hi Soroush, I'm still working on the task (needed to spend time reviewing SEERa's code again). A few initial questions:
I will have more consistent and frequent updates starting next week.
Hi @Sharjeeliv
Yes, reviewing SEERa's code will definitely help. However, the first steps of your task are not directly related to SEERa. Let's focus on your first subtask which is toy dataset generating. We need data to trace the models that we are going to implement in SEERa and we want to generate it in this step.
A desired file structure for the data can be like this: |- data.toy |--- scenario.1 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy |--- scenario.2 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy |--- scenario.3 |------ day.001.npy |------ day.002.npy |------ ... |------ day.T.npy
And this is how the data (user similarity matrices) should be look like for 3 users (U=3) for scenario.1 (increasing): day.1.npy = [[1,0.1,0.2] [0.1,1,0.3] [0.2,0.3,1]]
day.2.npy = [[1,0.3,0.4] [0.3,1,0.5] [0.4,0.5,1]]
day.3.npy = [[1,0.5,0.6] [0.5,1,0.7] [0.6,0.7,1]]
day.4.npy = [[1,0.8,0.9] [0.8,1,1] [0.9,1,1]]
Your task is to write a function in Python that takes T, U, and scenario number, and generates T different user similarity matrices as T .npy files and saves them. Please let me know if you need more clarifications.
Perfect, thank you. I was confused on how task 1 was linked to SEERa's code, this helped clear it up. I'll have this done asap.
Thanks @Sharjeeliv
Also, you can use the simplest way to increase and decrease user similarities. You can generate a random matrix for the first day, then increase it by (1-initial_value)/T for each day. You can use any other approaches that you may prefer. This was just a suggestion.
Hello @Sharjeeliv
I am writing to kindly request your prompt attention to the task at hand. We are in need of the output of your first subtask (toy dataset) in order to work on another issue (#71). If you could complete the task as soon as possible, it would be greatly appreciated. Please keep me posted.
Thank you for your cooperation.
Yep, I am almost done -just working on some kinks, it will be complete and posted by tonight.
Hi @soroush-ziaeinejad,
This is the completed task one, since it's not part of SEERa I posted the code here but I can make a PR with just this file if everything looks good to you. Also for scenario 3, I split the combinations and incremented half and decremented the other half.
This is how the saved files and the dataset looks when it's generated
import os
import numpy as np
from math import ceil
ROUNDING_FACTOR: int = 2
def generate_dataset(time_interval: int, users: int, scenario: int):
dataset = np.round(np.random.uniform(0.0, 1.0, size=(users, users)), ROUNDING_FACTOR)
mask = inc_and_dec_scenario_mask(users)
save_dataset(dataset, scenario, 1)
print("Original dataset:\n", dataset, "\n")
for day in range(2, time_interval + 1): # The loop begins at day 2, as day 1 is the random dataset
temp = generate_dateset_change(dataset, day)
# print(f"Change on iteration {day}:\n", temp, "\n")
if scenario == 1:
dataset += temp
elif scenario == 2:
dataset -= temp
elif scenario == 3:
dataset += temp * mask
else:
print("Scenario entry must be between 1-3")
return
dataset = dataset.clip(0, 1) # Value must be between 0 and 1
save_dataset(dataset, scenario, day)
# print(f"New dataset on iteration {day}:\n", dataset, "\n")
def generate_dateset_change(np_array: np.ndarray, day: int) -> np.ndarray:
return np.round((1 - np_array[:]) / day, ROUNDING_FACTOR)
def inc_and_dec_scenario_mask(users: int) -> np.ndarray:
size = users * users
temp = np.ones(size)
temp[ceil(size / 2):] *= -1
return temp.reshape([users, users])
def save_dataset(dataset: np.ndarray, scenario: int, day: int):
dest = f'../data.toy/scenario.{scenario}'
if os.path.exists(dest):
np.save(dest + f'/day.{day:03d}', dataset)
else:
os.makedirs(dest)
np.save(dest + f'/day.{day:03d}', dataset)
if __name__ == '__main__':
generate_dataset(3, 3, 2)
Hey @Sharjeeliv
Thank you for the clean and clear code! I just tested your code and it is exactly what I asked for. It seems we are done with the first subtask. I will provide a description for the second one tomorrow so you can start working on it.
@Sharjeeliv ,
There is a bug in the code. These matrices are user similarity matrices and should be symmetric. I think if you initialize symmetric matrices, they remain symmetric until the end. Can you please address this issue and update the code?
Thank you :)
@soroush-ziaeinejad,
I changed it to generate like this -this is a quick fix until I can figure out how to natively generate a symmetric matrix. How would you expect it to look for scenario 3? Since the matrix is not evenly divided in half if it is odd.
Scenario 3 - Current
Thank you @Sharjeeliv
Do you mean splitting them in "half"? If the problem is this, it's not a strict half, I expect increasing similarity for some users and decreasing for others.
By the way, please keep the diameter of all matrices as 1 because the similarity between a user and herself is always 1 for us. Sorry, I forgot to mention it before.
Please share the code when it's finished.
@soroush-ziaeinejad,
So in scenario three if the matrix is no longer symmetric that is ok?
I implemented the changes, here is the new code. Please let me know if anything else needs to be changed.
This is how the a generated matrix looks:
import os
import numpy as np
from math import ceil
ROUNDING_FACTOR: int = 2
def generate_dataset(time_interval: int, users: int, scenario: int):
dataset = generate_user_similarity_matrix(users)
mask = inc_and_dec_scenario_mask(users)
save_dataset(dataset, scenario, 1)
print("Original dataset:\n", dataset, "\n")
for day in range(2, time_interval + 1): # The loop begins at day 2, as day 1 is the random dataset
temp = generate_dateset_change(dataset, day)
# print(f"Change on iteration {day}:\n", temp, "\n")
if scenario == 1:
dataset += temp
elif scenario == 2:
dataset -= temp
elif scenario == 3:
dataset += temp * mask
else:
print("Scenario entry must be between 1-3")
return
dataset = dataset.clip(0, 1) # Value must be between 0 and 1
save_dataset(dataset, scenario, day)
print(f"New dataset on iteration {day}:\n", dataset, "\n")
def generate_user_similarity_matrix(users: int) -> np.ndarray:
dataset = symmetrize(np.random.uniform(0.0, 1.0, size=(users, users)))
np.fill_diagonal(dataset, 1) # A User is similar to themselves
return np.round(dataset, ROUNDING_FACTOR)
def symmetrize(np_array: np.ndarray)-> np.ndarray:
return (np_array + np_array.transpose()) / 2
def generate_dateset_change(np_array: np.ndarray, day: int) -> np.ndarray:
return np.round((1 - np_array[:]) / day, ROUNDING_FACTOR)
def inc_and_dec_scenario_mask(users: int) -> np.ndarray:
size = users * users
temp = np.ones(size)
temp[ceil(size / 2):] *= -1
return temp.reshape([users, users])
def save_dataset(dataset: np.ndarray, scenario: int, day: int):
dest = f'../data.toy/scenario.{scenario}'
if os.path.exists(dest):
np.save(dest + f'/day.{day:03d}', dataset)
else:
os.makedirs(dest)
np.save(dest + f'/day.{day:03d}', dataset)
if __name__ == '__main__':
generate_dataset(3, 3, 1)
@Sharjeeliv
Hello Sharjeel, this issue page is created for the task of time series forecasting for user similarity matrices. Here is the detailed explanation for this task: When UML (user modeling layer) is done, we have T different UU matrices that keep the similarity between U users for T time intervals. Suppose we have 60 time intervals and 1000 users. So we have 60 matrices with shape10001000. Your task is to predict the user similarity matrix for time interval T+1 (61 in our case).
- implement the simplest possible LSTM using Pytorch to be trained on the first k toy time series and predicts the _k+1_th and calculate the prediction accuracy (or error) using the real _k+1_th matrix as the ground truth. In this way, we can gradually see the performance of the LSTM model.
@Sharjeeliv As we move forward, I would like you to focus on the second subtask. You can make use of the code you have already created to generate user similarity matrices for a period of 30 days for 1000 users across all scenarios. This generated dataset will be your starting point.
Next, I would like you to implement a basic LSTM model with a single hidden layer using Pytorch. Train the model using the dataset, and then generate a prediction of another user similarity matrix for the upcoming time interval (31st day in this case).
Please note that I do not expect you to complete this entire subtask within a week, as progress is of utmost importance. Take your time to familiarize yourself with the concepts of Neural Networks and LSTM if you are not already familiar with them. Your findings and progress updates are important to us, so please keep us informed of your progress.
Please let me know if you need help or face any issues during this step.
Thanks
Since I'm not familiar with any of the topics in-depth, I'll spend the weekend studying the topics and learning PyTorch. I have found a tutorial on LSTM model with PyTorch so I'll start with that soon after.
Hi Soroush, I'll need to revise my original statement -it's going to take me longer than a weekend. I found a good course from Facebook which covers an introduction to neural networks with a focus on recurrent networks (has a section for LSTM) and PyTorch. I'm working through the course and doing the practice they have for PyTorch and NN's.
Course: https://www.udacity.com/course/deep-learning-pytorch--ud188
Hi @Sharjeeliv ,
Thanks for the update. That's perfect this course will definitely help to improve your knowledge about NNs and PyTorch. However, we usually don't need the deep theory behind NNs and their variations. Whenever you feel you are dealing with too much information it probably means you are focusing on theoretical details which can be skipped at this step.
Please do not hesitate to let us know if you face any issues or questions. We will try our best to help.
Have an adventurous journey into the NN world!
Hi @Sharjeeliv ,
I know this task may be taking longer than the previous one, but I want to encourage you to keep track and report any small progress that you have. Reporting your activities, even if you haven't achieved the desired results, is an essential skill that will benefit you greatly. Don't hesitate to share your understanding, every small experience or experiment you've tried, or any obstacles you've faced. So please keep us updated on your progress, and let us know how we can help.
Thanks :)
Whoops, I will post updates more frequently now. This is what I've done/am doing so far and my current understanding:
For each of the topics, I've been taking notes, doing practice coding, writing formulas, etc. I also ended up learning Latex because writing math notes is awful otherwise.
The main issue I've had is balancing everything with my assignments/courses, so I've been making progress much slower than I would like, but it's been steady, and SEERa is making much more sense to me now. I intend to try to finish Pytorch section before Monday and take a crack at the task, this course does have a topic for LSTM and a section on time-series forecasting, so worst case I'll be able to do the task as I finish that section.
@Sharjeeliv
Awesome! This is a perfect sample of a desired report of progress. Please keep going and keep us updated about your new steps. Thank you :)
A short pre-update: I'm finishing up what I set out in my last update. Since I mostly make progress on the weekend I'll be *giving my updates every Monday.
*trying to
An update on my progress, this content was slower to get through than the earlier section:
The goal is still the same, to finish this content and get started with the task. I am probably going to start on the task in parallel since from the tutorial it just seems like an application.
It seems you are getting ready for hands-on experience in deep neural networks! Does the course contain any coding assignments or practicing materials? @Sharjeeliv
Updated for brevity and accuracy
Yep, the course has several coding assignments focused on implementing various models (Default, CNN, RNN, etc.) with PyTorch.
As for an update, I’ve finished the core theory, particularly the following topics.
This is a brief summary of the core topics I’ve covered. I’m currently working on the hands-on PyTorch section.
Update 1/3 - PyTorch
The following is a final compilation of key points learned or accomplished during the course of the task.
The following are key points learned from the PyTorch course section:
Update 2/3 - RNN & LSTM
The following are key points learned from the RNN course section:
Update 3/3 - LSTM Draft Model
This is a refactored and significantly cleaned-up model, the output is still far off from what is expected. Although I was able to understand certain parts of making the model (those that were covered in the course) other parts like the LSTM hidden and cell states were confusing. Moreover getting the proper shapes and arrangement also ended up being trial and error. The training and testing were familiar but still time-consuming to get it to work.
import torch
import torch.nn as nn
import numpy as np
PATH = '/Users/sharjeelmustafa/Documents/02 Work/01 Research/Y3-22F/SEERa/data.toy/scenario.1/day.'
def get_input(t: int):
sequence = []
for i in range(1, t + 1):
data = np.load(f"{PATH}{i:03d}.npy").astype(np.float32)
# print(f"{torch.tensor(data).view(-1)}\n")
sequence.append(torch.tensor(data).view(-1))
return torch.stack(sequence).unsqueeze(1)
if __name__ == '__main__':
t = 8
input_dim, n_layers, batch_size, hidden_dim = 9, 1, 1, 9 # Extensive trial and error needed
num_epochs = 500 # Epochs
# Define model -shorthand
lstm = nn.LSTM(input_dim, hidden_dim, n_layers)
inputs = get_input(t)
hidden_state = torch.randn(n_layers, batch_size, hidden_dim)
cell_state = torch.randn(n_layers, batch_size, hidden_dim)
criterion = nn.MSELoss() # Error loss function
optimizer = torch.optim.Adam(lstm.parameters(), lr=0.0001)
for epoch in range(num_epochs):
# Prepare for training
lstm.train()
optimizer.zero_grad()
# Needed t odo this otherwise it crashed
hidden_state = hidden_state.detach()
cell_state = cell_state.detach()
# Forward pass - Still unclear on this
out, hidden = lstm(inputs, (hidden_state, cell_state))
hidden_state, cell_state = hidden
# Computing the loss
target = torch.tensor(np.load(f"{PATH}{t + 1:03d}.npy").astype(np.float32)).view(-1)
loss = criterion(out[-1, :, :], target)
# The backpropagation and updating step
loss.backward()
optimizer.step()
print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}")
# Model for inference i.e., evaluation
lstm.eval()
out, _ = lstm(inputs, (hidden_state, cell_state))
final_output = out[-1, :, :].view(3, 3)
print(f"Prediction\n{final_output}")
@Sharjeeliv
Hello Sharjeel, this issue page is created for the task of time series forecasting for user similarity matrices. Here is the detailed explanation for this task: When UML (user modeling layer) is done, we have T different U*U matrices that keep the similarity between U users for T time intervals. Suppose we have 60 time intervals and 1000 users. So we have 60 matrices with shape1000*1000. Your task is to predict the user similarity matrix for time interval T+1 (61 in our case).
Motivation: Our current approach is to generate T graphs from these T matrices (we consider the matrices as adjacency matrices) and then we build T embedded matrices with shape U*d (d=embedding_dimension) such that matrix t is calculated based on matrices 1 to t-1. This is called temporal graph embedding which is done in the graph embedding layer (GEL) in SEERa. The main problems with this approach are:
Your subtasks:
Please start with subtask 1 (toy dataset generating) and we will set regular meetings to discuss your findings and make changes if required. If you have any questions about your first subtask (or the whole task) please do not hesitate to contact me or ask for help.
@hosseinfani, Please share your comments and thoughts on this task. It will be really appreciated.
Thanks in advance!