HumanCellAtlas / dcp

Data Coordination Platform manifest and integration tests.
3 stars 1 forks source link

2019Q4 work towards reducing dataset processing failures from 1 in 3 to 1 in 20 and report 100% of inter-component handover errors. #519

Open justincc opened 4 years ago

justincc commented 4 years ago

The work in this quarter consists of:

a) per-component reliability work removed from this board as per discussion in PM 2019-10-31 call b) DCP-wide information gathering on reliability performance, recording and publication. It is critical this is tracked and completed at the DCP level.

Following DCP-wide feedback on the existing reliability information, subsequent quarters will need work to gather more information and then specific tasks to increase reliability. Outside of implementation work, we will need product to decide appropriate reliability targets in conjunction with tech-arch, and then a process to identify what needs to be done to improve reliability and schedule the items for action.

See the INSANE DCP reliability working group doc.

Follow on from #418.

ambrosejcarr commented 4 years ago

@morrisonnorman @justincc I understand from @brianraymor 's PM meeting notes that this can be scoped and delivered by Q4M3. However, I'm having difficulty parsing the work towards this roadmap objective, and i need help understanding a few things so that I can resolve the prioritization questions that Brian elevated to PL:

  1. Which tickets relate to reporting errors vs reducing rates from 1/3 to 1/20?
  2. Can you quantify how much progress has been made towards either goal in Q3? I see that the majority of work from https://app.zenhub.com/workspaces/dcp-5ac7bcf9465cb172b77760d9/issues/humancellatlas/dcp/418 is completed and the remaining ticket will be closed in Q4M1.
  3. Am I understanding correctly that this is blocked because it is a continuation of https://app.zenhub.com/workspaces/dcp-5ac7bcf9465cb172b77760d9/issues/humancellatlas/dcp/418?
  4. In https://docs.google.com/document/d/1NeYEWkOj25xdV4hM96q-v-0DLOZynmLnWPqlMI3Tjtk/edit# I see a phasing of the work that's not reflected here. Could you help me understand what you expect to complete in Q4, and what you propose to push to Q1 2020?

Thanks. Happy to get on a call early next week if it would be helpful.

ambrosejcarr commented 4 years ago

Norman and I sync'd up, we decided on some next steps. The short-term goals are to decide how to measure reliability, identify what the current reliability is, and estimate the improvement for Q3.

@morrisonnorman to sync with @parthshahva on:

@justincc to either confirm tickets are ordered, or do some rough ordering of tickets in terms of effort/value ratio.

When this is done, @morrisonnorman and @justincc should be able to associate some set of tickets from this epic to a targeted reliability improvement for Q4.

justincc commented 4 years ago

@ambrose, the DCP-wide work necessary to cover publishing DCP-wide reliability statistics, which will give the DCP awareness of the true situation in the same way that we have a project tracking dashboard, was being covered in #521. However, @brianraymor closed it because it needed work. I don't have time to try and perfect ticket descriptions this comment will have to serve notice that this work is critical and will be done under this epic.

Regarding the per-component tickets, for reasons known elsewhere I have not had time to consider them. In truth, this is an area where tickets emerge as and when we discover problems. As no other components have chosen to expose their work in this way, and because that part is confined to individual components, I propose to remove it entirely from the DCP view.