fix: using outcomespecs with `from_legacy` shows buggy behaviour

MartinBernstorff commented 6 months ago

Ask Mikkel Werling for details. mikkel.werling@regionh.dk

RH-MikkelWerling commented 6 months ago

I'm using the newest version, but the BooleanOutcomeSpec shows some buggy behaviour. I'm a bit busy tomorrow, but I'll test more thoroughly and provide an issue if the problem persists. Hopefully I'll get around to it at the end of tomorrow. Super grateful for getting such a fast response for the problem - thanks!

RH-MikkelWerling commented 6 months ago

Hey again, I finally got around to look at this bug and diagnose the problem. The problem persists on our server at Rigshospitalet, but I managed to fix it locally and also reproduce the problem.

The issue

For both PredictorSpec and for OutcomeSpec, the horizontal concatenation throws and error saying that "Series has no attribute drop". Most likely this is because it iterates over the individual dataframe, which then makes the object "df" a series and not a dataframe.

This problem occurs with:

python==3.10.9
iterpy==1.9.0
openssl==1.1.1w

But is fixed with:

python==3.10.14
iterpy==1.6.0
openssl==3.0.13

Code to reproduce error

import` datetime as dt
import numpy as np
import polars as pl
import pandas as pd

# Load a dataframe with times you wish to make a prediction
prediction_times_df = pl.DataFrame(
    {
        "id": [1, 1, 2],
        "date": pd.to_datetime(["2020-01-01", "2020-02-01", "2020-02-01"]),
    }
)
# Load a dataframe with raw values you wish to aggregate as predictors
predictor_df = pl.DataFrame(
    {
        "id": [1, 1, 1, 2],
        "date": pd.to_datetime(
            ["2020-01-15", "2019-12-10", "2019-12-15", "2020-01-02"]
        ),
        "value": [1, 2, 3, 4],
    }
)
# Load a dataframe specifying when the outcome occurs
outcome_df = pl.DataFrame(
    {"id": [1], "date": pd.to_datetime(["2020-03-01"]), "value_outcome": [1]}
)

# Specify how to aggregate the predictors and define the outcome
from timeseriesflattener import (
    MaxAggregator,
    MinAggregator,
    OutcomeSpec,
    PredictionTimeFrame,
    PredictorSpec,
    ValueFrame,
)

predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        init_df=predictor_df.lazy(),
        entity_id_col_name="id",
        value_timestamp_col_name="date",
    ),
    lookbehind_distances=[dt.timedelta(days=1)],
    aggregators=[MaxAggregator()],
    fallback=np.nan,
    column_prefix="pred",
)

outcome_spec = OutcomeSpec(
    value_frame=ValueFrame(
        init_df=outcome_df.lazy(),
        entity_id_col_name="id",
        value_timestamp_col_name="date",
    ),
    lookahead_distances=[dt.timedelta(days=1)],
    aggregators=[MaxAggregator()],
    fallback=np.nan,
    column_prefix="outc",
)

# Instantiate TimeseriesFlattener and add the specifications
from timeseriesflattener import Flattener

result = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=prediction_times_df.lazy(),
        entity_id_col_name="id",
        timestamp_col_name="date",
    )
).aggregate_timeseries(specs=[predictor_spec, outcome_spec])
result.collect()

I'll try to update everything on the server and that should fix everything. Just as a note; I get an error when running the example on the frontpage on Github due to conflicting specs from outcome and predictor specs (both have a column called "value", which then makes them conflicting). I'll implement this on my end and see if that fixes the problem on the server.

All the very, very best, Mikkel

MartinBernstorff commented 6 months ago

Ah yeah, sorry to hear it! We actually fixed this problem locally as well; you'll find that the iterpy dependency has been pinned on main to avoid it. I'm pretty sure just changing iterpy should fix it.

Excellent point re: the example, we'll take a look! Let me know if this is fixed.

RH-MikkelWerling commented 6 months ago

It freaking works with iterpy being downgraded, hallelujah! Thank you so much!

All the best, Mikkel

MartinBernstorff commented 6 months ago

Excellent! Closing.

Aarhus-Psychiatry-Research / timeseriesflattener

fix: using outcomespecs with `from_legacy` shows buggy behaviour #538

The issue

Code to reproduce error