firmai / atspy

AtsPy: Automated Time Series Models in Python (by @firmai)
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3580631
513 stars 89 forks source link

Looping AtsPy over 15,000 zip codes from Zillow #15

Closed jtfields closed 4 years ago

jtfields commented 4 years ago

I'm working on a project to predict the top three zip codes in the US for increases in housing prices. I used AtsPy to predict the price for one zip code (53012) and now I want to loop over 15,000 zip codes. I'm looking for suggestions for how to do this most efficiently and ways to save the results for each loop. I know this is more of a "how to use" question than an issue. I searched Stack Overflow and since AtsPy is so new there are no posts related to it yet. Thanks again for a great new package for Python!

zillowByZip = zillowUSA1997to2020.loc[zillowUSA1997to2020['ZipCode']==53012] zillowByZip = zillowByZip[['Value', 'Date']] zillowByZip.Date = pd.to_datetime(zillowByZip.Date) zillowByZip = zillowByZip.set_index("Date") model_list=["Gluonts"] am = AutomatedModel(df = zillowByZip, model_list=model_list, season="infer_from_data",forecast_len=60) forecast_in, performance = am.forecast_insample() forecast_out = am.forecast_outsample() all_ensemble_in, all_ensemble_out, all_performance = am.ensemble(forecast_in, forecast_out) forecast_out.head() performance all_performance all_ensemble_in[["Target","Gluonts"]].plot() all_ensemble_in all_ensemble_out all_ensemble_out[["Gluonts"]].plot() am.models_dict_in am.models_dict_out

firmai commented 4 years ago

Hi, because AtsPy is currently aimed at univariate forecasts, it is possible to do what you said, but the problem is that your multiple time series won't learn from each other when you are forecasting their future values. In which case you might have to go to GluonTS directly. See the solution here, https://github.com/awslabs/gluon-ts/issues/190.

If you do want to go ahead, for an efficient solution you might want to use parallel processing, https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop in a for loop, if you want to save the results after each loop you can just pickle it, or you can load the results into a dictionary and pickle that.

pickle.dump(dict_in, open("save.p", "wb"))
dict_out = pickle.load(open("save.p", "rb"))
jtfields commented 4 years ago

I used the .map function instead a for loop since it was faster: subset["forecast"] = subset["RegionName"].map(Forecast)

It's working but partway through the AtsPy AutomatedModel function it prints some of the plots and then stops due to a "at least one array or dtype is required" error. Any suggestions on what is causing this error?

ValueError Traceback (most recent call last)

in () ----> 1 subset["forecast"] = subset["RegionName"].map(Forecast) 7 frames pandas/_libs/lib.pyx in pandas._libs.lib.map_infer() /usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 473 474 if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig): --> 475 dtype_orig = np.result_type(*dtypes_orig) 476 477 if dtype_numeric: <__array_function__ internals> in result_type(*args, **kwargs) ValueError: at least one array or dtype is required
jtfields commented 4 years ago

Here is the code... def Forecast(Zip): zillowByZip = zillowUSA97to17.loc[zillowUSA97to17['RegionName']==Zip] if len(zillowByZip) < 3: return None elif len(zillowByZip) > 3: zillowByZip = zillowByZip[['Value', 'Date']] zillowByZip.Date = pd.to_datetime(zillowByZip.Date) zillowByZip = zillowByZip.set_index("Date") model_list=["Prophet"] am = AutomatedModel(df = zillowByZip, model_list=model_list, season="infer_from_data",forecast_len=60) forecast_in, performance = am.forecast_insample() forecast_out = am.forecast_outsample() all_ensemble_in, all_ensemble_out, all_performance = am.ensemble(forecast_in, forecast_out) forecast_out.head() performance all_performance all_ensemble_in[["Target","Prophet"]].plot() all_ensemble_in all_ensemble_out all_ensemble_out[["Prophet"]].plot() am.models_dict_in am.models_dict_out

subset = zillowUSA97to17.loc[zillowUSA97to17['State']=='MD']

subset["forecast"] = subset["RegionName"].map(Forecast)

jtfields commented 4 years ago

I did some more tests and the state of Maryland errors out after 5 zip codes. The state of Maine errors out after about 25 zip codes. It seems like there is some value in AtsPy that I need to reset prior to each loop. Can anyone provide some guidance on this?