GDS-Education-Community-of-Practice / DSECOP

This repository contains data science educational materials developed by DSECOP Fellows.
Creative Commons Zero v1.0 Universal
43 stars 25 forks source link

Connor - Time_Series_Analysis_and_Forecasting #23

Closed cnrrobertson closed 1 year ago

cnrrobertson commented 1 year ago

Module under folder Time_Series_Analysis_and_Forecasting

daleas0120 commented 1 year ago

N1:

################################################################################
# Initial setup
################################################################################

# Setup gravity force function f_g(t,u) where u = (v_x, v_y, p_x, p_y)
g = -9.8 # m/s
def f_g(t, u):
    return np.array([0, g, u[0], u[1]])

# Setup initial conditions (velocity (100,100) m/s at position (0,0))
u0 = np.array([100, 100, 0, 0])

# Start at time 0 and go to time 10, collect every Δt = 0.01
t0 = 0
t1 = 200
dt = .1

I think you mean to set t1=10?

You might want to consider breaking this code cell into several smaller code cells, and converting the in-line comments to jupyter notebook markdown. I think it might help the students slow down and think about each portion of the code if they have to execute multiple cells to get a result, especially since a student is not required to alter this code block to make it run. Another thing that might help the students slow down and think, is if you ask them to uncomment lines of code to make the cell run correctly.

The units on the color bar for Figure 1 also don't make sense to me. If I am simulating the code from t=0 to t=200, why does the colorbar axis go to 20? A brief explanation of how you are handling the units might be helpful.

"If we set the angle to 45, the initial velocity to 21, the mass to 1, the drag to .5, the area to .1, and the density to .1, we get landing at roughly x=40."

It may be worth adding a comment that there are other solutions possible.

Now, in reality each of these measurements would be polluted with some amount of noise due to errors in the measurement or some unaccounted for effect. With this in mind, let's add some Gaussian noise to our measurements relative to their average.

As an instructor, I would want to add one more cell that calculates the uncertainty in the prediction from the model. This can be done fairly easily just by using Monte Carlo sampling for the model parameters and doing a bit of a sensitivity analysis. Maybe it could go into the appendix?


Overall time to completion, about 20 minutes.

daleas0120 commented 1 year ago

N2:

Basic concepts

Some of the basic ideas that are of interest in time series data are the following:

Trends: Is there a general direction of the data? Seasonality: Is there a trend that repeats itself on a fixed schedule? Cyclical component: Is there a trend that repeats itself outside of a fixed schedule? Irregular variation: Are there unpredictable and erratic variations in the data? Autocorrelation: Are observations at a point in time usually similar to observations at a previous point of time (e.g. an observation at time 𝑡 is always similar to an observation at 𝑡−1)?

Here, it may be helpful to either embed pictures giving examples of the different patterns, or add hyperlinks to examples. I'm only a bit familiar with this type of time series analysis, and the nuance between autocorrelation and seasonality or cyclical component is a little lost on me (isn't there always going to be autocorrelation any time the data has some frequency component?)

For Problem 1: I struggled with this one. I reused the

plt.close('all')
seasonal_decompose(test_launch['Drag coefficient'], model='additive',period=4).plot().suptitle("Additive",y=1)
plt.tight_layout()
plt.show()

code and adjusted the period value from 1 to 20 in increments of one. Some things popped up: first, I can't make the period larger than a certain number, which I think comes from the test_launch = launches[20] code line, but explicit confirmation of this in the notebook would be nice. Also, I was able to get very good periodicity in the Seasonal plot, with almost no change in the Trend plot. Are there some helpful comments that would help students know what to look for to improve their results?

The main difference between them is the resid plot which represents the "residual error," or the error component of the decomposition. Though the size of this error is not important for our current case, we don't want it to have any noticeable pattern, as this would indicate that it is not erratic and unpredictable. For this reason, let's stick with the additive decomposition.

I would suggest moving this discussion of residual error up higher in the notebook, to the first time students encounter it. There are several cells that produce residual error plots without any explanation.

If the variance of the series is constant, the size of the seasonal peaks are constant through time. Let's compare this in our series by looking at the variance of the height and the area over time. This would be to consider the sample variance $\bar{\sigma}$:

$\bar{\sigma}k = \frac{1}{k}\sum{i=2}^k (u - \hat{u}_k)$

for $2 \leq k \leq N$.

The meaning of $u$ is defined earlier to be the mean of the series. But then what is $\hat{u}_k$? I figured out that it is the mean of the first k values, but it may be worth explicitly stating and not assuming that students are familiar with the definition of variance.


Time to complete: about 1 hr

I very much like this notebook. You touched on many of the concepts I first learned during my 600 level Random Variables class, and did so in a very intuitive way. Well done. :)

cnrrobertson commented 1 year ago

I've incorporated the changes to notebook 1 except for the parameter uncertainty (just for sake of time). I may be able to add that next week. Thanks for all the great feedback!

cnrrobertson commented 1 year ago

I've also incorporated the changes recommended in notebook 2. I included a plot illustrating some of the concepts and moved the description of the residual plot up. Thanks for catching my typo with the mean and variance discussion!

daleas0120 commented 1 year ago

N3:

Suggestion: After this code block

# Load our launch data
data_location = "https://raw.githubusercontent.com/GDS-Education-Community-of-Practice/DSECOP/connor_module/Time_Series_Analysis_and_Forecasting/launches.csv"
all_launches = pd.read_csv(data_location, index_col="Time (s)")

# Split into individual launches
split_indices = np.where(all_launches.index[1:] - all_launches.index[0:-1] < 0)[0].tolist() # Find where time decreases (signifies different launch)
split_indices = [0] + split_indices + [all_launches.shape[0]]
launches = [all_launches.iloc[split_indices[i]+1:split_indices[i+1]] for i in range(100)]

I suggest adding a code cell all_launches.head() so that students are reminded of what data features are actually in the dataset before they get to Problem 2 where they need to use those additional features.

"An feedforward neural network is a function that was designed to mimic biological neural networks. It can be written as simply "

Grammar: "A feedforward (...)"

# Pass input x into first layer of size 3 x 2
y_1 = keras.layers.Dense(3,activation="tanh")(x)

# Pass middle or "hidden" layer into output
y   = keras.layers.Dense(1,activation="tanh")(y_1)

I suggest linking to Keras documentation here for how to "daisy-chain" the layers. To my knowledge, this is one of the only python libraries where the function arguments use two sets of parenthesis: function(function_params)(function_input) rather than function(function_input, function_params). Since you ask students to build their own models later on, it might be worth emphasizing this point.

keras.utils.set_random_seed(0)

# Take the first quarter of the data (stationary)
original_distance = test_launch["Distance (m)"].shift()
distance = test_launch["Distance (m)"] - test_launch["Distance (m)"].shift()
quarter_distance = np.array(distance.iloc[1:17])
quarter_height = np.array(test_launch["Height (m)"].iloc[1:17])

# Organize the data for our recurrent neural network
k = 2
distance_in = []
distance_out = []
for i in range(len(quarter_distance)-k):
  # Take k samples at time t_i ... t_{i+k-1}
  distance_in.append(quarter_distance[i:i+k].reshape((k,1)))
  # Get function output at time t_{i+k}
  distance_out.append(quarter_distance[i+k])

distance_in = np.array(distance_in)
distance_out = np.array(distance_out)

I suggest putting this portion into its own code cell, then making a markup cell that explains you are going to create a new model and retrain before the remaining code. Since transfer learning is a common technique to improve results from a NN, making it clear that you are training a model from scratch using the adjusted data will help keep students from having a fuzzy concept of what is happening in this notebook, especially since you reuse the same architecture twice.

# Make simple many to one model (input 2 samples of size 1)
x = keras.layers.Input(shape=(k,1))
y = keras.layers.SimpleRNN(10,activation="tanh", return_sequences=True)(x)
y = keras.layers.SimpleRNN(1,activation="linear", return_sequences=False)(y)
distance_model  = keras.Model(inputs=x,outputs=y)

# Train model
distance_model.compile(
    optimizer = keras.optimizers.Adam(),
    loss = keras.losses.MeanSquaredError()
)
history = distance_model.fit(
    distance_in,
    distance_out,
    batch_size=10,         # The training takes groups of samples (in this case 10 samples at a time)
    epochs=2000,           # The number of times to iterate through our dataset
    validation_split = 0,  # Use 0% of data to check accuracy
    verbose=0,             # Don't print info as it trains
    callbacks=[TqdmCallback(verbose=0)]
)

I also suggest putting the above into its own code cell, to help students find which part of the code is easiest to reuse for their own models.

" (...) the distance prediction go backward!"

Grammar: "prediction goes backwards" or "predictions go backward"

Problem 3: The theme of this task is data manipulation (compared to the rest of the notebook, which is non-linear time series regression), and what students will practice in this exercise is how to parse a Pandas Dataframe. However, this is not information that is explicitly covered elsewhere in the notebook. I am admittedly bad at reading/remembering where things are, and had to go back and forth through the notebook several times to find where the test_launch = launches[20] was defined. It feels slightly cumbersome. So as an instructor, I would try to provide at least some of the following information for my students in this notebook:

Suggestion: I love towardsdatascience.com, but since they paywall after a few articles, you might also want to include a free references in your appendix. I liked this one: https://neptune.ai/blog/time-series-prediction-vs-machine-learning

Suggestion: The way that the notebook is structured, it is difficult to go back and rerun cells to see how tweaking one variable changes something else. For example, if I am copy-pasting code and then I update k during problem 3, I can't go back and rerun problem 1 easily. I would suggest providing additional template code for the programming problems that encourages students to rename their own variables instead of directly copying-pasting, either by providing variable names for them in a template (k becomes my_k_value, distance_model becomes problem_1_model) or wrapping your earlier code into functions that helps keep the workspace tidy.

Suggestion: You ask the students to tweak model hyperparameters, but other than qualitatively inspecting the plots, don't give them a good quantitative metric for seeing if their tweaks improve the model or not. This could be easily added to the plotting code, which they will then presumably copy-paste for their own solutions. You may even want to give them a pre-defined plotting function our_plotting_function that they can call, since the point of the notebook is not to have them practice plotting but practice time series analysis. This would help increase code readability, reuse, and keep workplace variables tidy.


My main comment is that for the first two-thirds+ of the notebook, the failure of the example RNN models to correctly predict the launch data risks demonstrating to students that an RNN in general is not a good choice for "real" time-dependent data analysis. The notebook also leaves it to a student's skill as a data science engineer to get good results by the end of the notebook when they have never seen this model "work" successfully on launch data. The example you give with the sin(x) data shows increasing the length of the time analyzed as the solution for the RNN model, so removing the time dependence for the launch data later in the notebook is not an intuitive choice for students. Finally, the sin(x) data is periodic--which does a good job of leveraging the recurrent connections--while the launch data is not periodic (so why then is a RNN a better choice than a vanilla FCNN for the launch data? The argument about fewer model parameters due to the "looping" evaporates).

So, I'm going to pull a "Reviewer 2" on you and state that RNNs as presented in this notebook are generally considered to be a poor choice for time series data because the RNN is going to learn a frozen distribution even while considering multiple time steps at once. It should be made clear to the students that the reason the RNN works for the example sin(x) data is because the model learns the entire periodic nature of the data. The launch data (as presented in the example code) requires the RNN to predict two different half-parabolas, one of which it has never seen before. This is fine if you want students to learn how a model can fail to generalize, but if the point is that an RNN can in fact accomplish this task, then this feels like a backwards way to do it. The notebook clearly improves the results by introducing stationarity, but the predicted data in the early part of the notebook is symmetric around the training-testing data split point because the model has only learned one of the two half-parabolas and then mirrored it around the split point.

I think a major improvement would be replacing the sin(x) data with the "parabolic motion without drag" data created in N1. This data is close enough to the programming tasks assigned to the students that they can lift the thought process (such as stationarity) and repeat it fairly closely for the launch data. A second improvement would be teaching students the LSTM neuron architecture rather than an RNN based on the standard neuron: LSTMs have a nice symmetry with N2 since it has a constant that shifts with the distribution like the moving average of ARMA, in addition to leveraging the multiple time steps of an RNN.

Here is my solution for problem 2:

image

I think your explanation of the RNN structure, loss function, and the way the model weights are updated was excellent. Overall, very well done and I enjoyed working through the material. :)

Total time to complete: 1 hr

cnrrobertson commented 1 year ago

I suggest adding a code cell all_launches.head() so that students are reminded of what data features are actually in the dataset before they get to Problem 2 where they need to use those additional features.

Very good point. Since I did the other notebooks, it is easy to think that they are already familiar.. I added in a little exposition of what they are.

I suggest linking to Keras documentation here for how to "daisy-chain" the layers. To my knowledge, this is one of the only python libraries where the function arguments use two sets of parenthesis: function(function_params)(function_input) rather than function(function_input, function_params). Since you ask students to build their own models later on, it might be worth emphasizing this point.

Yeah, it's called the "Funcional API" for keras and basically what is going on are the keras functions are returning Python functions, which I immediately call by inputting a value. Definitely worth linking and clarifying. I thought it made more sense than the usual Sequential() way of doing things, but it is for sure not clear for the students.

I suggest putting this portion into its own code cell, then making a markup cell that explains you are going to create a new model and retrain before the remaining code.

I also suggest putting the above into its own code cell, to help students find which part of the code is easiest to reuse for their own models.

Very good point. I've made the adjustments.

Explicitly tell student how many launches are in their dataset, and how long each launch is both in seconds and number of samples. This is never explained; N2 just states "A large number of these runs" and N3 just shows/explains how to split them up based on the time value.
Explicitly state why launch 20 is used as the test case (as opposed to the first launch, or the last launch), and why we are ignoring the rest of the data for so long

These are now included at the start of the notebook Explicitly state why it looks like the notebook only uses the first 1/2 of the launch time for training the model when the drag force is most visible after this point. Its only for an imagined scenario where you only see the beginning of the trajectory and need to calculate the rest. I added a bit to the problem description to clarify. Explicitly separate out the data manipulation into its own cell and use the ## Markdown to create a notebook section so that students can go back and find where this happened easily for Problem 3 Made this adjustment building off your previous comment. Hyperlink to a Pandas data frame cheat sheet so that they can use some of the plotting options and table options as needed Very good idea. I added this

cnrrobertson commented 1 year ago

My main comment is that for the first two-thirds+ of the notebook, the failure of the example RNN models to correctly predict the launch data risks demonstrating to students that an RNN in general is not a good choice for "real" time-dependent data analysis. The notebook also leaves it to a student's skill as a data science engineer to get good results by the end of the notebook when they have never seen this model "work" successfully on launch data. The example you give with the sin(x) data shows increasing the length of the time analyzed as the solution for the RNN model, so removing the time dependence for the launch data later in the notebook is not an intuitive choice for students. Finally, the sin(x) data is periodic--which does a good job of leveraging the recurrent connections--while the launch data is not periodic (so why then is a RNN a better choice than a vanilla FCNN for the launch data? The argument about fewer model parameters due to the "looping" evaporates).

I think my perspective is first that this act just as a demonstration of RNNs as applied to a simple time series. I actually think that in as simple a case as this, there are a variety of methods which would outperform or at least match an RNN (including a vanilla FNN or a CNN or even simpler - honestly ARIMA is great in N2). My objective with the early sections of this notebook was to walk them through the considerations, pains, and adjustments to get a neural network working i.e. how to be a good data engineer, which I feel like always ends up being the hardest part of getting a neural network working well.. But in retrospect, I think you are right that it makes the RNN seem like a bad choice to the students. I've added a note at the start of the forecasting for projectile data section giving context to the challenges.

So, I'm going to pull a "Reviewer 2" on you and state that RNNs as presented in this notebook are generally considered to be a poor choice for time series data because the RNN is going to learn a frozen distribution even while considering multiple time steps at once. It should be made clear to the students that the reason the RNN works for the example sin(x) data is because the model learns the entire periodic nature of the data. The launch data (as presented in the example code) requires the RNN to predict two different half-parabolas, one of which it has never seen before. This is fine if you want students to learn how a model can fail to generalize, but if the point is that an RNN can in fact accomplish this task, then this feels like a backwards way to do it. The notebook clearly improves the results by introducing stationarity, but the predicted data in the early part of the notebook is symmetric around the training-testing data split point because the model has only learned one of the two half-parabolas and then mirrored it around the split point.

It is not clear to me that the success of the RNN on the sin(t)cos(t) data is due to its periodicity. Ultimately, RNNs really boil down to having a second input which is some processed version of the previous input (which could theoretically be accomplished by inputting multiple previous steps into an FNN). Though the example I gave is a poor demonstration of the reduced number of model parameters, that is a key strength to RNNs. It is also a poor example for the RNN to be trained only on increasing data and then expect it to get the second half of the parabola correct (hence the stationarity improvement and definitely the improvement in Problem 3. The half parabola mirroring could also be argued for the classical ARIMA setting). But hopefully the students get a sense of seeing an RNN in action. Even if it's not the optimal setting, I hope they at least understand the application!

I think a major improvement would be replacing the sin(x) data with the "parabolic motion without drag" data created in N1. This data is close enough to the programming tasks assigned to the students that they can lift the thought process (such as stationarity) and repeat it fairly closely for the launch data. A second improvement would be teaching students the LSTM neuron architecture rather than an RNN based on the standard neuron: LSTMs have a nice symmetry with N2 since it has a constant that shifts with the distribution like the moving average of ARMA, in addition to leveraging the multiple time steps of an RNN.

I think using the projectile motion without drag would definitely give better results due to the symmetry of the parabola, but I think that would be a misleading success of the RNN. I also think the stationarity example and data loading with drag is sufficient for them to replicate. For the LSTM, I thought about adding it but realized that there was actually minimal improvement on the SimpleRNN (with adjusted parameters... too many parameters in neural networks....). Since I didn't really describe the LSTM architecture and its definitely more complicated than they need to know, I included a link to its description and keras function in the appendix along with the GRU architecture. I presented them as more complicated but improved versions of the vanilla RNN.