Closed YUKUN-XIAO closed 1 year ago
Hi Yukun, The reasoning for the clustering is that ML-based modeling uses featurized training data. The training data could be:
i). 1 time series at a time (Prophet or ARIMA for example). This means you might have to build 1 million specialized models, which can be time-consuming.
ii). All the training data at once (possible with deep learning or transformer algorithms). This means building 1 single generalized model to fit all the data at once. If the data has distinctly different segments with different behaviors, then the 1 model might be missing accuracy. Conceptually, it's like fitting a mean curve through a bunch of data at once.
iii). A good compromise might be to choose reasonable clusters, then train a different model per cluster. So, instead of 1 milliion models or only 1 model, with clustering, you might get a handful of models.
For retail data, often the natural clustering is geo- , since different products are sold by geo, and this type of clustering is more likely to group together data with similar underlying patterns.
The taxi data seemed to have real data vs what looked like too-smooth fake data (1 ride at regular time intervals). The taxi data had only 2 out of 4 possible clusters. For the demo, I kept only the “erratic” cluster, from the clustering technique explained here. (Scroll down to Step 21 or cell 117): https://github.com/aws-samples/amazon-forecast-samples/blob/ab6cf3c48fa1a22c892d997c3b7a9235a0f019c0/workshops/pre_POC_workshop/1.Getting_Data_Ready_nytaxi_weekly.ipynb
Hope that helps with the motivation for clustering! Christy
Thank you so much!
May I ask, what is the basis for clustering the data into two groups? According to the code, I found that the final output is a model that works best.Can we train a model for each location and export the model?