alkaline-ml / pmdarima

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.
https://www.alkaline-ml.com/pmdarima
MIT License
1.57k stars 231 forks source link

How can I reduce memory usage when fit auto arima ? #346

Open eromoe opened 4 years ago

eromoe commented 4 years ago

Hello,

I am trying to use auto_arima search best model with 20 thousand timesries (no exogenous variables) , this size is quite smaller than another datatset(nearly a million) trained with fbprophet (with some exogenous variables) .

But , to my suprise , spark throw memory error , I have gave each executor core over 2GB memory . Traning code is very simple :

        pipeline = Pipeline([
            ("boxcox", BoxCoxEndogTransformer()),
            ("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
        ])
        pipeline.fit(X['y'].to_numpy() + 1)

Is there some setting I missed can reduce the memory usage in trainning?

tgsmith61591 commented 4 years ago

When you say 20 thousand timeseries do you mean 20k samples? Can you please provide a bit more information, like how you're triggering these model fits on Spark executors, and what the stacktrace looks like?

eromoe commented 4 years ago

They are real data , length vary from 1~400 ( most are 400) .

main trainning steps example


def train_model(d):
    X = d['data']

    pipeline = Pipeline([
        ("boxcox", BoxCoxEndogTransformer()),
        ("model", pm.AutoARIMA(start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', stepwise=True, suppress_warnings=True))
    ])
    pipeline.fit(X['y'].to_numpy() + 1)

    d['model'] = pipeline

    return d

def pickle_model(d):
    model = d['model']
    d['model_pickled'] = bytearray(dill.dumps(model))

    return {
        "store_id": d['store_id'], 
        "product_id": d['product_id'], 
        "model_pickled": d['model_pickled'], 
        "train_days": d.get('train_days'),
    }

rdd map:

df = load_data(spark, ...)

df1 = (df.rdd
      .map(lambda r: r.asDict())
      .map(lambda d: transform_data(d))
      .filter(lambda d: len(d['data']) > min_train_length)
      .map(lambda d: train_model(d))
      .map(lambda d: pickle_model(d))
)

schema = StructType([*[ StructField(i, StringType(), True) for i in group_cols], 
StructField('model_pickled', BinaryType(), True)])
df2 = spark.createDataFrame(df1, schema)

df2.write.parquet(output_path, mode='overwrite') 

In fbprophet, it has some redundant property like model.history , set model.history=-1 would reduce much storage . So I wonder if there something similar in pmdarima .

tgsmith61591 commented 4 years ago

I think some of the recent changes in #359 and #361 might help with this. Hoping to get v1.7.0 out in the near future.

eromoe commented 4 years ago

n_samples : 933

>>> ms = pickle.dumps(m)
>>> len(ms)/1024**2
96.15693759918213

Just for record. Haven't tested new version yet .