christy / AnyscaleDemos

Apache License 2.0
21 stars 9 forks source link

code question #114

Closed YUKUN-XIAO closed 1 year ago

YUKUN-XIAO commented 1 year ago

May I ask, what is the basis for clustering the data into two groups? According to the code, I found that the final output is a model that works best.Can we train a model for each location and export the model?

christy commented 1 year ago

Hi Yukun, The reasoning for the clustering is that ML-based modeling uses featurized training data. The training data could be:

i). 1 time series at a time (Prophet or ARIMA for example). This means you might have to build 1 million specialized models, which can be time-consuming.

ii). All the training data at once (possible with deep learning or transformer algorithms). This means building 1 single generalized model to fit all the data at once. If the data has distinctly different segments with different behaviors, then the 1 model might be missing accuracy. Conceptually, it's like fitting a mean curve through a bunch of data at once.

iii). A good compromise might be to choose reasonable clusters, then train a different model per cluster. So, instead of 1 milliion models or only 1 model, with clustering, you might get a handful of models.

For retail data, often the natural clustering is geo- , since different products are sold by geo, and this type of clustering is more likely to group together data with similar underlying patterns.

The taxi data seemed to have real data vs what looked like too-smooth fake data (1 ride at regular time intervals). The taxi data had only 2 out of 4 possible clusters. For the demo, I kept only the “erratic” cluster, from the clustering technique explained here. (Scroll down to Step 21 or cell 117): https://github.com/aws-samples/amazon-forecast-samples/blob/ab6cf3c48fa1a22c892d997c3b7a9235a0f019c0/workshops/pre_POC_workshop/1.Getting_Data_Ready_nytaxi_weekly.ipynb

Hope that helps with the motivation for clustering! Christy

YUKUN-XIAO commented 1 year ago

Thank you so much!