dhopp1 / nowcast_lstm

LSTM neural networks for nowcasting economic data.
MIT License
55 stars 10 forks source link

missing values in mix frequencies data #7

Closed AZFARHAD24511 closed 9 months ago

AZFARHAD24511 commented 10 months ago

In the context of LSTM applied to mixed-frequency data, the initial step involves addressing missing data. This is particularly crucial when dealing with a target variable that follows a monthly frequency, while the features are recorded daily. Before any computations take place, it becomes necessary to impute the missing data in the target variable. To illustrate, consider a scenario where, within a one-month timeframe, there exist 29 instances of missing data. My specific inquiry pertains to the underlying logic employed by the LSTM model in handling such a dataset.

dhopp1 commented 10 months ago

The data/observations the model is being trained on look like this:

y(jan 31st) = (x1(jan 31st) + x1(jan 30th) + ... x1(jan 2nd)) + (x2(jan 31st) + ....) y(feb 28th) = (x1(feb 28th) + x1(feb 27th) + ... ) + (x2(feb 28th) + ....) ...

So the intervening days, say February 7th, 8th, etc., do not have observations for the target variable and thus are not observations the model is trained on. So no imputation is required. But of course those daily data are still being considered in the model, since the x1(feb 8th) etc. are inputs for the y(feb28th) target observation.

Let me know if this is not clear.

AZFARHAD24511 commented 9 months ago

The data/observations the model is being trained on look like this:

y(jan 31st) = (x1(jan 31st) + x1(jan 30th) + ... x1(jan 2nd)) + (x2(jan 31st) + ....) y(feb 28th) = (x1(feb 28th) + x1(feb 27th) + ... ) + (x2(feb 28th) + ....) ...

So the intervening days, say February 7th, 8th, etc., do not have observations for the target variable and thus are not observations the model is trained on. So no imputation is required. But of course those daily data are still being considered in the model, since the x1(feb 8th) etc. are inputs for the y(feb28th) target observation.

Let me know if this is not clear.

The formula you've provided is a pure mathematical expression lacking specific parameters. It only holds true when all the parameters are identical and equal to 1.

dhopp1 commented 9 months ago

It is a simplified representation to illustrate the structure, not to be taken literally. The point is that the model is only trained on observations where a Y is present, therefore, there is no need to impute intervening Ys.

AZFARHAD24511 commented 9 months ago

Consider the scenario where we possess several variables (features) recorded daily from January 1, 2000, to the end of December 2023. Additionally, there is a single variable (target variable) recorded monthly over the same period. The objective is to employ an LSTM model to forecast the first two weeks of 2024. However, the challenge lies in the absence of daily data for the initial two weeks of 2024.

My specific inquiry is twofold: firstly, can the LSTM model be effectively utilized to predict the first two weeks of 2024 despite the unavailability of daily data for that period? Secondly, pertaining to the target variable, which is recorded on a monthly basis, is it feasible to estimate its daily values from January 1, 2000, to the end of December 2023?

I would greatly appreciate a comprehensive response to these queries. Thank you.

dhopp1 commented 9 months ago

@Diamong6688 Can you put this in a separate issue? As to not confuse this one. Also there isn't enough information to go off of, you'd have to provide more context, your actual code not just the error message, etc. Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu but this line suggests maybe try not using the GPU and just do everything on the CPU first to make sure it works.

diamond-jlu commented 9 months ago

Dear Professor Daniel Hopp, there was an error when I ran your LSTM model. After constant debugging, the model could not be run using only CPU or GPU. Does your code need torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio===0.8.1 to run? My version of pytorch is 2.0.1. I have created a new question by uploading the wrong code and error according to your requirements, and deleted the original comment. I would be very grateful if you could teach me and solve this problem!

------------------ 原始邮件 ------------------ 发件人: "dhopp1/nowcast_lstm" @.>; 发送时间: 2023年12月2日(星期六) 晚上6:00 @.>; 抄送: "♔ @.**@.>; 主题: Re: [dhopp1/nowcast_lstm] missing values in mix frequencies data (Issue #7)

@Diamong6688 Can you put this in a separate issue? As to not confuse this one. Also there isn't enough information to go off of, you'd have to provide more context, your actual code not just the error message, etc. Input and hidden tensors are not at the same device, found input tensor at cuda:0 and hidden tensor at cpu but this line suggests maybe try not using the GPU and just do everything on the CPU first to make sure it works.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

dhopp1 commented 9 months ago

@AZFARHAD24511 The answer to both your questions is yes. Below is an example that does exactly what you are asking:

import numpy as np
import pandas as pd
from nowcast_lstm.LSTM import LSTM

# generating example data
dates = list(pd.date_range("2020-01-01", "2024-01-31",freq = "d")) # data from 2020 to 2024 just for the example to run faster
n_features = 3
data = pd.DataFrame(np.random.rand(len(dates), n_features))
data.columns = ["feature_" + str(i) for i in range(1, n_features+1)]
data["date"] = dates
data = data.loc[:, ["date"] + list(data.columns[:-1])]

target_variable = "feature_3"
data.loc[~data.date.dt.is_month_end, target_variable] = np.nan

# training data up to 2023
train = data.loc[data.date < "2024-01-01", :]

# test data
test = data.copy()
test.loc[data.date >= "2024-01-01", data.columns[1:]] = np.nan # make sure no 2024 data available

# training a model
model = LSTM(
    data = train,
    target_variable = target_variable,
    n_timesteps = 30, # training based on prior 30 days' worth of data
    n_models = 10,
    train_episodes = 50,
    batch_size = int(len(data)/4)
)
model.train()

# if you predict on the test data you can see daily predictions are available
model.predict(test)

Keep in mind the example is using random data, so the predictions won't be good, it is just for sake of illustration. Running model.predict(test) you can see that now daily predictions are available, including through January 2024. Since this model is trained with a time lag of 30 days, the predictions through January 29th all have some published data to go off of, so vary slightly. For January 30th and 31st, there is no more published data to make predictions off of (they are all NAs), so the model can only predict the mean.

The prediction for January 15th, for example, takes into account data from December 17th to January 15th. There is data available for December 17th-31st, so that information is used to create the prediction, while the January 1st-15th data, which doesn't exist, goes in as the series mean.

Hopefully this example helps answer your question.