[BUG]: Incidence-to-Prevalence instruction set needs a bit of refinement

liunelson commented 4 days ago

There are some minor bugs in the code generated from the incidence-to-prevalence prompt.

I had to edit the prompt and code a bit to get the right behaviour:

I have three dataframes d1, d2, d3. Do not edit them. Create new dateframes. d1 is incident case counts. d2 is incident hospitalization counts. d3 is cumulative death counts. Let's assume that average time to recover is 14 days and average time to exit the hospital is 10 days. Can you convert this data into prevalence data? Ideally please map it to SIRHD. Assume a population of 320 million. Make sure that every value of I, R, H, D, S is at least 0.0, not negative.

The code is:

import pandas as pd

# Assuming d1 is the dataset for new cases (I), d2 for hospitalizations (H), and d3 for deaths (D)
# Convert 'date' columns in all dataframes to datetime type
d1['date'] = pd.to_datetime(d1['date'], format = '%Y-%m-%d')
d2['date'] = pd.to_datetime(d2['date'], format = '%Y-%m-%d')
d3['date'] = pd.to_datetime(d3['date'], format = '%Y-%m-%d')

# Sort dataframes by date
d1 = d1.sort_values('date')
d2 = d2.sort_values('date')
d3 = d3.sort_values('date')

# Calculate active cases (I) - Sum of new cases over the last 14 days
I = d1['value'].rolling(window=14, min_periods = 1).sum()

# Calculate hospitalizations (H) - Sum of new hospitalizations over the last 10 days
H = d2['value'].rolling(window=10, min_periods = 1).sum()

# Deaths (D) are cumulative, so we can use the value from d3 directly
D = d3['value']

# Calculate recoveries (R) - Cumulative sum of cases up to (current date - 14 days) - current deaths
R = d1['value'].cumsum().shift(periods = 14, fill_value = 0.0) - D

# Create a new dataframe for I, R, H, D
df_SIRHD = pd.DataFrame({
    'date': d1['date'],
    'I': I,
    'R': R,
    'H': H,
    'D': D
})

# Calculate the susceptible population (S)
total_population = 320e6  # Total population
df_SIRHD['S'] = total_population - df_SIRHD[['I', 'R', 'H', 'D']].sum(axis=1)

# Replace negative values with 0.0 in specified columns
df_SIRHD[['I', 'R', 'H', 'D', 'S']] = df_SIRHD[['I', 'R', 'H', 'D', 'S']].clip(lower=0)

# Remove rows where the 'date' column is NaT
df_SIRHD.dropna(subset=['date'], inplace=True)
df_SIRHD

The particular edits are:

Tell the agent to avoid editing the input variables (to work around the "Reset kernel" and "Rerun all cells" issue)
remove any code that changes the index to the date
add min_periods = 1 on the rolling sums to avoid np.na showing I, H variables
add fill_value = 0.0 to the cumulative sum for R
replace negative values and remove rows with NaN date

liunelson commented 4 days ago

Re-running the prompt above gave me code with incorrect logic unfortunately:

import pandas as pd

# Convert 'date' columns in all dataframes to datetime for proper manipulation
d1['date'] = pd.to_datetime(d1['date'])
d2['date'] = pd.to_datetime(d2['date'])
d3['date'] = pd.to_datetime(d3['date'])

# Sort dataframes by date to ensure correct rolling calculations
d1 = d1.sort_values('date')
d2 = d2.sort_values('date')
d3 = d3.sort_values('date')

# Calculate 'Infected' (I) category
d1['I'] = d1['value'] - d1['value'].shift(14)
d1['I'] = d1['I'].fillna(0).apply(lambda x: max(x, 0.0))

# Calculate 'Recovered' (R) category
d1['R'] = d1['value'].shift(14)
d1['R'] = d1['R'].fillna(0).apply(lambda x: max(x, 0.0))

# Calculate 'Hospitalized' (H) category
d2['H'] = d2['value'] - d2['value'].shift(10)
d2['H'] = d2['H'].fillna(0).apply(lambda x: max(x, 0.0))

# 'Dead' (D) category directly comes from d3
d3['D'] = d3['value'].apply(lambda x: max(x, 0.0))

# Calculate 'Susceptible' (S) category
# First, ensure we have a common date range for all calculations
common_dates = set(d1['date']) & set(d2['date']) & set(d3['date'])
df_I = d1[d1['date'].isin(common_dates)][['date', 'I']]
df_R = d1[d1['date'].isin(common_dates)][['date', 'R']]
df_H = d2[d2['date'].isin(common_dates)][['date', 'H']]
df_D = d3[d3['date'].isin(common_dates)][['date', 'D']]

# Merge dataframes on 'date' to align all calculations
df_merged = pd.merge(pd.merge(pd.merge(df_I, df_R, on='date'), df_H, on='date'), df_D, on='date')

# Calculate 'Susceptible' (S)
total_population = 320e6
df_merged['S'] = total_population - df_merged['I'] - df_merged['R'] - df_merged['H'] - df_merged['D']
df_merged['S'] = df_merged['S'].apply(lambda x: max(x, 0.0))

# Selecting the final dataframe to display the compartmentalized data
df_final = df_merged[['date', 'S', 'I', 'R', 'H', 'D']]

# Display the head of the final dataframe to verify the calculations
print(df_final.head())

liunelson commented 4 days ago

A picture = a thousand words

The left operator shows the correct result while the right one is showing results calculated using the code from rerunning the prompt above.

brandomr commented 4 days ago

@liunelson looking into these....in the first comment, is that code correct? If not, do you have the corrected code by any chance?

brandomr commented 4 days ago

@liunelson are you testing with these datasets:

{"d1":"2fb95ea8-8e71-47c6-b6a9-9c2d6d31484e",
"d2": "06a36516-f4ad-4233-a5fe-dd4a2e3343cd",
"d3":"561b00ad-331f-4d0c-8eb2-0dd813290eef"}

The issue is that they have a bunch of locations so need to be filtered first (e.g. to just United States). In that plot on the right in your screenshot I think what we're seeing is a whole bunch of values for the same date (due to multiple locations on that date). Still poking at it to confirm though

brandomr commented 4 days ago

@liunelson testing with those datasets I referenced above (staging) everything seems to work for me just fine once I filter for location.

DARPA-ASKEM / askem-beaker

[BUG]: Incidence-to-Prevalence instruction set needs a bit of refinement #168