Marcnuth / AnomalyDetection

Twitter's Anomaly Detection in Pure Python
Apache License 2.0
304 stars 76 forks source link

Refactor loops to enable parallelization? #32

Open hokiegeek2 opened 4 years ago

hokiegeek2 commented 4 years ago

I am looking into parallelizing a section of code in detect_anoms where the majority of execution time is spent:

    if not one_tail:
        ares = abs(data - data.median())
    elif upper_tail:
        ares = data - data.median()
    else:
        ares = data.median() - data

    ares = ares / data.mad()

    tmp_anom_index = ares[ares.values == ares.max()].index
    cand = pd.Series(data.loc[tmp_anom_index], index=tmp_anom_index)

    data.drop(tmp_anom_index, inplace=True)

Is there a way to refactor the code so that ordering enforced by the for loop for the data.drop invocations is no longer needed?

Similar question here:

for i in range(1, data.size + 1, num_obs_in_period):
    start_date = data.index[i]
    # if there is at least 14 days left, subset it, otherwise subset last_date - 14 days
    end_date = start_date + datetime.timedelta(days=num_days_in_period)
    if end_date < data.index[-1]:
        all_data.append(
            data.loc[lambda x: (x.index >= start_date) & (x.index <= end_date)])
    else:
        all_data.append(
            data.loc[lambda x: x.index >= data.index[-1] - datetime.timedelta(days=num_days_in_period)])
return all_data

I am a software engineer, not a data scientist, so this may be a very naive question. :)

--John