data-to-insight / csc-validator-be-903

Python-side code for the Quality LAC Data Beta frontend
https://903.datatoinsight.org
MIT License
8 stars 4 forks source link

Knowing the child's period of care dates #341

Open PhilDeakin opened 3 years ago

PhilDeakin commented 3 years ago

Quite a few of the validation rules are dependent on knowing the dates when the children started and ended care (which can happen multiple times), whilst it's possible, albeit with difficulty, to calculate this from the data in episode files, we'd need all of them loaded if they've been in care a long time. As far as I can see we've only got episode and episode_last and we'll need more than that or another way of entering the period of care dates for each child.

For example see Error code 1001. We'd need to know how long they've been in care between the ages of 14 and 16.

dezog commented 3 years ago

This is interesting - we may have to either add the ability to upload numerous years of data, which could complicate the user interface and architecture of the tool itself, or decide that rules like this are beyond the scope of the tool for now, and clearly document that fact.

Do you know how this works when uploading to the actual Gov.UK site? As in, do analysts have to upload multiple years' worth of data, or does the site check against previous years' uploads that they already hold?

PhilDeakin commented 3 years ago

It holds all the files back to 2004, we just upload this year's files to it.

dezog commented 3 years ago

Interesting. Would I be right in guessing analysts don't necessarily have easy access to data going back that far?

It would be a fair amount of work to build the capability to incorporate that much historical data, especially if the previous returns are formatted according to the previous years' specifications. This would then tie into the question raised by @SLornieCYC in https://github.com/SocialFinanceDigitalLabs/quality-lac-data-beta/issues/56 of how to deal with different years' specifications.

If it's straightforward to pull out old data in the new format, it'd make things much simpler, but still require building out the ability to take the extra data into the backend, and creating the user interface to let you load it in via the frontend.

(Sorry but I don't have a very clear picture of the actual systems you guys are dealing with on your end, so this is all a bit hypothetical - would definitely appreciate your perspective)

I'm leaning towards leaving this as an 'if we have time', particularly if many people won't have the data ready, and for now:

  1. Make a note of all the rules affected by this issue, and
  2. Make sure the fact they're not checked is visible to the user so they're not totally blindsided when the Gov.UK portal opens.

Thanks for flagging this up. Keen to get your thoughts on the above

PhilDeakin commented 3 years ago

Yeah, I don't think it's really viable to have that much data on the system. I agree we should just make a note of the ones that require period of care date to calculate and leave them out.

SLornieCYC commented 3 years ago

I agree with the above. Adding multiple years of historic data is quite a leap from the current-and-previous we're looking at right now even if it would only be the Episodes file we need for earlier years. I don't imagine it would be easy (for example) to make the upload process combine multiple Episode-format files from the "previous year" box into a single dataset given the likelihood of duplicate-but-not-exactly-matching rows between years where episodes were open at 31 March.

Although in terms of excluding validation rules as "out of scope" I think there are possibly two groups of rules to consider here:

For the first group presumably we need a section in the tool config to list excluded rules which can then be returned on the front-end alongside the multi-select for active validations. Should this be just a list of rule numbers or also include a reason for exclusion? (either free-text or from pre-set categories)

The second group could either be treated as entirely "out of scope" (easiest) or require some bespoke logic to handle cases where insufficient information is available (i.e. be able to return "valid", "not valid", or "not tested" outcomes). For example with #106 we can only test the POC start date < DATE_PLACED when the relevant RNE = S episode is in either the current or previous year data (or even only in the current year if the previous year isn't provided!).

dezog commented 3 years ago

This is great, thanks both. Excellent work making this list of rules @SLornieCYC.

I think for the 'not possibles', a generic message explaining that these checks require more historic data than the tool can do should be sufficient; just has to signpost that people will have to try and pick those ones up for themselves.

I agree about the 'sometimes possibles': we should ideally be displaying whether we can actually perform those checks or not for a given child (are there cases where we wouldn't be able to tell if we have all the data?). This could potentially be tied into how we display rules that can/can't be checked due to tables not being uploaded, as in https://github.com/SocialFinanceDigitalLabs/quality-lac-data-beta/issues/61

I think for the time being though, we should make a note of all such rules we come across and which category they fall into and just treat them as 'out of scope' for now.

I'll add labels so we can tag these issues as we see them -- if you can think of better names than out of scope & sometimes possible, let me know!

I'm on leave next week, but we'll be working on improvements to the frontend when I get back, including a way to show which rules haven't been checked. That should put us in a better place to judge what to do about these in the time we have left.

dezog commented 3 years ago

Putting a list of excluded rules in config.py sounds exactly right.

@SLornieCYC @PhilDeakin, what do you guys think of this:

For 'out of scope' rules, adding it to the list should close the issue for the rule.

The 'sometimes possibles' should be added to a separate list, and we should note that in a comment on the issue, which should remain open until either we: