InseadDataAnalytics / INSEADAnalytics

Other
122 stars 1.31k forks source link

Holdouts with limits data points #93

Open samjameswaller opened 6 years ago

samjameswaller commented 6 years ago

Hi all - for Yahoo/Tumblr - if we only have ~40 data points, does it make sense to have a holdout / testing sample? My concern is if we hold out any data, we won't have enough to train the model properly. Thanks! Sam

VarunKShetty commented 6 years ago

Hi @samjameswaller

Two points here:

  1. Since this is a time-series model, we will hold out only the latest observations. Say the last (latest) 5 observations or the last 2 observations and NOT a randomly chosen set of observations. This is because, of the inherent structure of time-series data -- we would assume that we know the past and want to predict the future.
  2. Since we have only 40 data points, we will have to hold out a much smaller sample. However, we would miss out on being able to take advantage of the law of large numbers when dealing with residuals in this case. For example, consider the case that you hold out only 3 data points and one of them is an outlier -- you would end up with a large holdout error even if the model itself was a pretty good fit.

tl;dr: There is no fool-proof answer to this. You will have to tradeoff based on considerations, few of which I have highlighted above.

samjameswaller commented 6 years ago

Thanks Varun,

Much appreciated

From: Varun Karamshetty notifications@github.com Reply-To: InseadDataAnalytics/INSEADAnalytics reply@reply.github.com Date: Wednesday, 16 May 2018 at 10:07 am To: InseadDataAnalytics/INSEADAnalytics INSEADAnalytics@noreply.github.com Cc: WALLER Sam sam.waller@insead.edu, Mention mention@noreply.github.com Subject: Re: [InseadDataAnalytics/INSEADAnalytics] Holdouts with limits data points (#93)

Hi @samjameswallerhttps://github.com/samjameswaller

Two points here:

  1. Since this is a time-series model, we will hold out only the latest observations. Say the last (latest) 5 observations or the last 2 observations and NOT a randomly chosen set of observations. This is because, of the inherent structure of time-series data -- we would assume that we know the past and want to predict the future.
  2. Since we have only 40 data points, we will have to hold out a much smaller sample. However, we would miss out on being able to take advantage of the law of large numbers when dealing with residuals in this case. For example, consider the case that you hold out only 3 data points and one of them is an outlier -- you would end up with a large holdout error even if the model itself was a pretty good fit.

tl;dr: There is no fool-proof answer to this. You will have to tradeoff based on considerations, few of which I have highlighted above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/InseadDataAnalytics/INSEADAnalytics/issues/93#issuecomment-389432526, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AlfMgmT4ytGbXI-1eKq39wZUhdFdkBpeks5ty945gaJpZM4T_xH9.