data-to-insight / csc-validator-be-903

Python-side code for the Quality LAC Data Beta frontend
https://903.datatoinsight.org
MIT License
8 stars 4 forks source link

Bug: Rule 602 - Rule inverted #526

Closed SLornieCYC closed 2 years ago

SLornieCYC commented 2 years ago

Describe the bug Rule is currently flagging all children adopted in year (AD1) with an in-year episode <REC> = E11 or E12.

Expected behavior It should be flagging children in AD1 who don't have an in-year episode <REC> = E11 or E12.

Proposed fix Either:

dezog commented 2 years ago

By the looks of it, this shouldn't be happening (See below). We drop non-E11/E12 rows from Episodes before merging with AD1, but then keep only the 'left_only' rows

I've looking through some fake data and it appears to be working as intended. It might be worth checking for:

            mask1 = (epi['DEC'] <= collection_end) & (epi['DEC'] >= collection_start)
            mask2 = epi['REC'].isin(['E11', 'E12'])
            adoption_eps = epi[mask1 & mask2]

            adoption_fields = ['DATE_INT', 'DATE_MATCH', 'FOSTER_CARE', 'NB_ADOPTR', 'SEX_ADOPTR', 'LS_ADOPTR']

            err_list = (ad1
                        .merge(adoption_eps, how='left', on='CHILD', indicator=True)
                        .query("_merge == 'left_only'")
                        .dropna(subset=adoption_fields, how='all')
                        .index
                        .to_list())

            return {'AD1': err_list}
SLornieCYC commented 2 years ago

Tried again, making sure I changed the collection year to 20/21. There shouldn't be any issues with case sensitivity as I'm using last years' finalised return csvs as exported from the DFE portal.

image

image

image

And here are the rows as they appear in Notepad+ (deleted other rows from Episodes for screenshot only):

image

image

tab1tha commented 2 years ago

@SLornieCYC can we say the only cases where adoption values are provided and an error is not raised should be when REC of that child is found to be E11 or E12 OR when the child is observed to have an episode within the collection period?

If this is correct, then the error can be resolved by changing & to | in the line where mask 1 is combined with mask 2. That is,

            mask1 = (epi['DEC'] <= collection_end) & (epi['DEC'] >= collection_start)
            mask2 = epi['REC'].isin(['E11', 'E12'])
            adoption_eps = epi[mask1 | mask2]

            adoption_fields = ['DATE_INT', 'DATE_MATCH', 'FOSTER_CARE', 'NB_ADOPTR', 'SEX_ADOPTR', 'LS_ADOPTR']

            err_list = (ad1
                        .merge(adoption_eps, how='left', on='CHILD', indicator=True)
                        .query("_merge == 'left_only'")
                        .dropna(subset=adoption_fields, how='all')
                        .index
                        .to_list())

            return {'AD1': err_list}

I am drawing the interpretation from the pseudocode of the error as stated here: https://github.com/SocialFinanceDigitalLabs/quality-lac-data-beta-validator/issues/262

SLornieCYC commented 2 years ago

I think the only cases where an error should not be raised are where child has <REC> = E11 or E12 OR where child does not have an episode in the collection period.

image

tab1tha commented 2 years ago

That makes a lot of sense. Thank you.

tab1tha commented 2 years ago

This is not as intuitive as I earlier imagined, I am considering the possibility that the bug is coming from the display side. @SLornieCYC please can you confirm that the Child ID is the same in the first and second screenshot which you shared?.

Also, is the error raised only when one of the conditions is true and the other is false but not raised when both conditions are true?.

SLornieCYC commented 2 years ago

Tested again on the live site:

Collection year: 2020/2021 Files loaded (from last year's submitted return): Adoption, Episodes

Result is as shown in the screenshots provided above.

As @dezog wrote above, it does look like the mask is the right way around already because mask1 and mask2 are working additively to return only in-year E11/E12 episodes (which are then joined to AD1 and discarded as "correct" to find the remaining errors). I'm really struggling to see why - or where - my data is failing here, because to me it looks right.

dezog commented 2 years ago

Thanks for that @SLornieCYC, that's pretty definitive: this rule is playing up !

dezog commented 2 years ago

@tab1tha:

@SLornieCYC can we say the only cases where adoption values are provided and an error is not raised should be when REC of that child is found to be E11 or E12 OR when the child is observed to have an episode within the collection period?

If this is correct, then the error can be resolved by changing & to | in the line where mask 1 is combined with mask 2. (...) I am drawing the interpretation from the pseudocode of the error as stated here: #262

That pseudocode is so confusing; it does seem to say that, but I think it's mistaken as the text of the rule seems clear that it's meant to be making sure that any child with AD1 info filled in has an episode with E11/E12 REC AND a DEC during the year:

The episode data submitted for this child does not show that he/she was adopted during the year (...) If the child has been adopted during the year code E11 or E12 should be recoded as being the reason the episode ceased

dezog commented 2 years ago

I think I've cracked it - we're using the index resulting from the merge - we need to save the original index by doing .reset_index() first, then .set_index('index') before the .index

tab1tha commented 2 years ago

Nice spot Desi. I have implemented that change. Thank you @SLornieCYC for all the help !. Merry Christmas