fchollet / deep-learning-with-python-notebooks

Jupyter notebooks for the code samples of the book "Deep Learning with Python"
MIT License
17.95k stars 8.48k forks source link

Chapter 10: Discrepancy between problem statement and Keras implementation in timeseries_dataset_from_array() #238

Open juandevprojects opened 3 months ago

juandevprojects commented 3 months ago

Description: Reading the section 10 Deep learning for timeseries, there appears to be a potential discrepancy between the problem statement and the actual implementation.

Problem Statement: The problem statement, as described in section 10.2.1, outlines a scenario where temperature data and other variables for 5 days, sampled once per hour, are provided. The objective is to predict the temperature 24 hours ahead.

Concern: According to the problem statement, there are 120 samples in 5 days (24 samples per day). The dataset should consist of sequences representing 5 days of data, with each sequence containing a maximum of 120 samples.

Keras Implementation: However, when utilizing the timeseries_dataset_from_array() function with parameters sampling_rate = 6 and sequence_length = 120, it generates sequences corresponding to 30 days (4 samples per day). This seems to deviate from the problem statement's objective of predicting temperature with data from 5 days, not 30.

Proposed Solution: One potential solution could be adjusting the sequence_length parameter to 20. This adjustment would ensure that sequences contain data from 5 consecutive days (4 samples per day using sampling_rate = 6), aligning with the problem statement's requirements.

Request for Clarification: I'd appreciate clarification on whether my analysis is accurate and if the implementation aligns with the intended problem statement. If not, guidance on how to correctly utilize the timeseries_dataset_from_array() function for the specified problem would be valuable.

Thank you for your attention to this matter.

shenchenbing commented 3 months ago

The original data contains 6 sets of data per hour. So sampling_rate=6 means 1 set of data per hour. The book description is correct.