data-to-insight / csc-validator-be-903

Python-side code for the Quality LAC Data Beta frontend
https://903.datatoinsight.org
MIT License
8 stars 4 forks source link

Bug: Rule 437 - Flags additional episodes when there aren't any #625

Open tab1tha opened 1 year ago

tab1tha commented 1 year ago

Describe the bug Rule is flagged when reason for closure is E15. However, that episode should not be flagged unless there is another episode after an episode with that code.

To Reproduce

CHILD | DECOM | DEC | REC | REASON_PLACE_CHANGE -- | -- | -- | -- | -- aaaaa | 17/07/2022 | 26/08/2022 | E15 |   aaaaa | 16/08/2021 | 17/07/2022 | X1 | CARPL

Expected behavior That E15 position is the latest episode. It shouldn't have been flagged.

Screenshots If applicable, add screenshots to help explain your problem. NOTE: Ensure you remove any identifying information

Proposed fix (Optional) Could it be a sorting problem that is causing the next line to be interpreted as a subsequent episode even though it happens on an earlier date?

tab1tha commented 1 year ago

It has been confirmed that when the data is manually sorted, this error is no longer encountered.

WillLP-code commented 1 year ago

Having input the data from the bug report into a test dataframe and writing the 437 function as a separate script for testing, I've come up with this, which works:

'''If <REC> = ‘E2’ or ‘E15’ then no other 
episode <DECOM> should be later than 
current episode <DEC>'''

import pandas as pd

fake_data_episodes = pd.DataFrame({
    'CHILD': ['101', '102', '102', '103', '103', '103', '104', '104', '104', '104', '105', '105', 'sanjay', 'sanjay', 'not_sanjay', 'not_sanjay'],
    'DECOM': ['20/10/2021', '19/11/2021', '17/06/2021', '04/04/2020', '11/09/2020', '07/05/2021', '15/02/2020',
                '09/04/2020', '24/09/2020', '30/06/2021', pd.NA, '20/12/2020',  '16/08/2021', '17/07/2022','07/05/2021', '15/02/2020',],
    'REC': ['E2', 'E15', 'X1', 'X1', 'E15', pd.NA, 'X1', 'E4a', 'X1', pd.NA, 'E2', 'X1', 'X1', 'E15', pd.NA, pd.NA],
})

episodes = fake_data_episodes
episodes.reset_index(inplace=True)

episodes['DECOM'] = pd.to_datetime(episodes['DECOM'], format='%d/%m/%Y', errors='coerce')

episodes_2 = episodes.copy()

episodes = episodes[episodes['REC'].isin(['E2', 'E15'])]

episodes_merged = episodes.merge(episodes_2, on=['CHILD'], suffixes = ('_1', '_2'))
episodes_merged = episodes_merged[episodes_merged['DECOM_1'] != episodes_merged['DECOM_2']]

condition = (episodes_merged['DECOM_1'] < episodes_merged['DECOM_2']) & (~episodes_merged['REC_2'].isin(['E15', 'E2']))
#error_mask = episodes_merged[condition]

error_locations = episodes_merged.loc[condition, 'index_2']

print(error_locations)

The original rule code did some quite funky things by sorting the episodes table by ID and DECOM, comparing each row to the next to find out if it was a a failing row, which could have caused the bug that's produced that disappears when sorting manually:

episodes.sort_values(['CHILD', 'DECOM'], inplace=True)
 episodes[['NEXT_DECOM', 'NEXT_CHILD']] = episodes[['DECOM', 'CHILD']].shift(-1)

Interestingly, putting the current 437 function in a separate script and testing it against the bug report data, I could reproduce the bug when sorting manually in any way unless I also forgot to sort the REC field too.

The new version I've written merges the dataframe on itself which allows for comparisons between DECOM dates without having to do the sort.

No one has to feel compelled to respond to this, It's just a useful way for me to store my notes and share them if anyone is also looking at this bug.

SLornieCYC commented 1 year ago

condition = (episodes_merged['DECOM_1'] < episodes_merged['DECOM_2']) & (~episodes_merged['REC_2'].isin(['E15', 'E2']))

I don't think that last part (in bold) is needed. Once a child has had a <REC> of E15 or E2 (handled earlier in your script) there should be no further episodes at all, whether they also end in the same <REC> codes or not!