COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Update selected biosamples fields without stomping all fields #10

Closed SamStudio8 closed 3 years ago

SamStudio8 commented 3 years ago

With PHA now looking to backfill data, the pressure is finally high enough to pick up the support for PATCH requests to the API to update sample fields. See https://github.com/SamStudio8/majora/issues/32.

SamStudio8 commented 3 years ago

Bumped by #8 (outbound upload continuity takes priority) and #13 (local work takes priority).

SamStudio8 commented 3 years ago

Bumped by #17 (PHA work takes priority) Update: Bumped by #20 (critical performance work takes priority)

SamStudio8 commented 3 years ago

Bumped by #26 (PHA work takes priority)

SamStudio8 commented 3 years ago

In lieu of new definitions to help toward #26, I'm bumping #10 back to the top of the queue as it has finally become a threat. I'll need to revisit the code itself (the lines in question will be almost a year old now), but I am imagining:

SamStudio8 commented 3 years ago

Mulling over a coffee I don't see how we can avoid trouble partially validating a form without backfilling the form data with initial. Options are limited here:

I'll look for some best practice guidance this afternoon. First port of call will be to write tests that improve the coverage of these forms to ensure any later changes don't disrupt.

SamStudio8 commented 3 years ago

Testing and coverage now public at Travis and codecov. Test suite improvements for biosamples: https://github.com/SamStudio8/majora/commit/f6f90c5e360f9fa6b9086f65a779bca227ed899e, https://github.com/SamStudio8/majora/commit/3846a14c6b8bcbb57462c7fc64ada418519219d7.

SamStudio8 commented 3 years ago

This is an annoyingly philosophical problem. I made the conscious decision last year when I wrote these to lose some of the benefits of the ModelForm in favour of writing a form that actually represents the API interface. I think that this is more clear and easier to maintain, but I must now balance whether the freebies for using the ModelForm outweigh keeping a good representation of the API.

For example, the form to add biosamples covers fields from the BiosampleArtifact, BiosourceSamplingProcess and BiosampleSource models; if we instead used the ModelForm we'd have to maintain the fields and their validators across the three distinct forms and control them from the form_handler which I anticipate could be difficult giving how tangled these models are.

An alternative might be to leverage the existing as_struct on each model and use those fields to populate the form. Handily, the struct version will match the form fields (as there are a couple of form/model discrepancies such as adm1 and collection_location_adm1). The pro of this, the work is very much already in place. The con is that I was trying to deprecate it because it is too inefficient for the the v3 API, and it mostly serves to clutter up the model specifications. If we do this, note that some fields are still suppressed through as_struct (restricted) so we would need to add a mode to ensure they can be read for populating form fields.

It is perhaps worth investigating an initial implementation of the ModelForm route to see what Django magic we get for free. The pro will be removing a loft of duplicate cruft from forms (although I do like the explicit representation of the forms here), and obviously automatic population of form fields when updating an object. If it works out well it will likely benefit the code base, but the con is that this is more substantial work; and I wonder how much work this really saves us -- how much hacking will we need to add in form_handlers to orchestrate the three tightly coupled forms and objects? What about repopulating the biosample_source_id when it needs to be fetched from the process tree? What happens if the second or third ModelForm fails validation but we've already taken action on the first two?

SamStudio8 commented 3 years ago

https://github.com/SamStudio8/majora/commit/7916d90ea7d9e17472cf3b5afc3c78f878dc2439 introduces additional test coverage for the registration of biosamples (but not the COGUK supplemental), as well as a test designed to demonstrate the perils of stomping. These tests should now provide a proper platform on which to test the effects of any proposed fixes to the infamous stomping problem.

SamStudio8 commented 3 years ago

I'm still not entirely convinced with the ModelForm approach. The magic that Django uses to fill in the form from the model instance is hilariously similar to the as_struct method that I'd made for the v2 API which isn't exactly rocket science. However, it's hard to make an informed decision without trying to integrate an example. The easiest model for which to do this is one that I'd neglected to list earlier, the COGUK_BiosourceSamplingProcessSupplement. This model isn't complex; its field names match the form names, and the validation doesn't have any special additional rules either.

https://github.com/SamStudio8/majora/commit/1bb5aacf711e297f7682bf88db6bc388022e48b4 removes the form_handling for the supplemental model and turns it over to the magic ModelForm instead. These fields have been added to the biosample tests to show that the change does not damage how supplemental information is created or updated (https://github.com/SamStudio8/majora/commit/5e2dc2991c0d719548427bb790efa2654929cc88). It was somewhat painless.

Frustratingly, I've realised the issue at hand won't be solved with ModelForm magic after all; sending null values will still cause the ModelForm to dutifully overwrite these fields. Options are:

(3) isn't possible because we need the forms to be able to validate with partial data to support updates in the first place. (2) is basically the as_struct proposal from earlier. (1) is effectively (2) with the potential bonus of removing some cruft from forms and form_handlers.

SamStudio8 commented 3 years ago

Sorry I'm talking rubbish, I'm conflating two issues here; a) partial updates are not possible because the validators will fail without a complete model, b) uploading a bunch of null will blow up the model. B is actually intended behaviour and we still want to allow overwriting of data with null, this will also ensure the metadata uploader continues to work in a stompy fashion for backwards compat. We don't want to drop null from the input data, but we do want to use the ModelForm to fill in missing data such that the validators can do their job and accomplish A.

SamStudio8 commented 3 years ago

After some experiments last night I've discovered Form and ModelForm are automatically filling missing POST fields with None, which makes it very difficult to support partial forms. I think this is why I was confusing myself yesterday. This is because the form's _clean_fields calls value_from_datadict on the form widget which uses the dict get, returning None if the field is not in the request.

Further investigation for a workaround needed, presumably we'll have to hack the fields or exclude list on the ModelForm. This is exactly the sort of thing that DRF was for, I guess.

I'm not entirely convinced given the ModelForm doesn't even work as magically as hoped that it will be worth the effort. What happens if a future validator depends on fields from more than one model -- this would be almost impossible to handle with ModelForm.

SamStudio8 commented 3 years ago

If you want something to work the way you want, you're just gonna have to do it yourself. I've subclassed the Django forms.ModelForm, and overridden the __init__ to accept two keyword arguments: partial and partial_request_keys, if present, these will post-filter the fields property (not base_fields) to forcibly boot the missing fields off the form (https://github.com/SamStudio8/majora/commit/2e01512e862d5d8108c9ced9c4dce539add93c16). This means when it comes to the cleaning step, the form will only call validation (and fill in the cleaned data dict) for the fields that have been sent, neatly this also limits updates to the instance to those fields too.

+  468 class MajoraPossiblePartialModelForm(forms.ModelForm):                                                                                                                                                                                        +  469     def __init__(self, *args, **kwargs):                                                                                                                                                                                                      
+  470         partial = kwargs.pop('partial', False)                                                                                                                                                                                                
+  471         partial_request_keys = kwargs.pop('partial_request_keys', [])                                                                                                                                                                         
+  472         super().__init__(*args, **kwargs)                                                                                                                                                                                                     
+  473                                                                                                                                                                                                                                               
+  474         # Drop any fields that are not specified in the request                                                                                                                                                                               
+  475         if partial and partial_request_keys:                                                                                                                                                                                                  
+  476             allowed = set(partial_request_keys)                                                                                                                                                                                               
+  477             existing = set(self.fields)                                                                                                                                                                                                       
+  478             for field_name in existing - allowed:                                                                                                                                                                                             
+  479                 self.fields.pop(field_name)

I've integrated this example (https://github.com/SamStudio8/majora/commit/1bb5aacf711e297f7682bf88db6bc388022e48b4). A new test test_biosample_add_update_partial_supp_stomp (https://github.com/SamStudio8/majora/commit/aa4513a9bd9cb799ad38d536e505870018a50866#diff-16d4692c80c1f313ffd79c9896781b0f71de4c3c96bdfacf16ce06ba44f4c237R413) attempts to set the COG supplemental fields one by one (ie. with missing fields that would previously set everything back to None). The test checks that the instance has both changed from the last update, and matches the expected new state. This test now passes -- you're looking at a working implementation of a solution to the stomping problem :100:

The supplemental data model is small fry compared to the complexity of the other models involved in the biosample add endpoint, and I'm still not entirely convinced the ModelForm is the way to go, but we've got a lead at least.

SamStudio8 commented 3 years ago

Note that this implementation will break ungracefully when a user tries to create a new object with partial=True and fails to provide any field that is required on the underlying model.

SamStudio8 commented 3 years ago

My next candidate is the BiosourceSamplingProcess, partly because it seems to be easier to refactor this function by starting at the end and working back, but also because it'll prove whether the ModelForm strategy is garbage. We present "nicer" field names to users (eg. collection_location_adm1 is presented as adm1) which will need to be mapped somewhere. I'm thinking the most straightforward way to do this is use the modify_preform to just rename the fields. We can presumably jam something into the Meta to hold the canonical mapping.

I've gone ahead and assumed that storing the field mapping in the form's Meta class is a great idea https://github.com/SamStudio8/majora/commit/301da1f8545422d0ba58eef68bf8c09fe9d8db21

SamStudio8 commented 3 years ago

With much more frustration than planned for, I've wrangled the BiosourceSamplingProcess to a stomp-safe implementation. I've added a map_request_fields function to the earlier superclass that maps request data to the real model field names according to the Meta.field_map of the form instance. We intercept data and initial before they head up to BaseForm via super to fix field names, and also intercept the _errors in _post_clean to spit errors with the user-facing field names back to clients. It's pretty cool, actually.

Honestly this was more work than intended, and I worry we've injected a whole load of Majora-specific magic that is less clear now the ModelForm is taking care of things. Aside from that, the test case shows that the stomp-free implementation works. However, it's also discovered an edge case where adding an error to the form inside clean will raise a ValueError if the field has been suppressed from the form (by Django's design). This only happens in cases where we add an error to a field on the form based on another field (e.g. adding a warning to received_date if collection_date is also missing). The problem here is that the form clean is processed regardless of whether the fields are present or not. I'm sure we'll find out how much of an issue that is, eventually.

SamStudio8 commented 3 years ago

The groundwork laid for the sampling process meant that the remaining pieces were actually quite straightforward, the BiosampleSource and BiosampleArtifact are now also stomp free. https://github.com/SamStudio8/majora/commit/f868d06db9d4f038243531ee5bae35beb0ecabe6 completely removes the form_handlers.handle_testsample function, all the heavy lifting is done exclusively by api_views.add_biosample. The interface now takes a partial attribute that enables my partial updates magic on the superclassed ModelForm (disabled by default).

The test_biosample_add_update_partial_supp_stomp test case covers all user-reachable fields and appears to show that creating and updating (partial or otherwise) work as expected. There is some remaining weirdness with None vs. '' that I don't entirely understand, I suspect I need to formally remove null=True from the underlying models as that is not good practice.

Shockingly, after avoiding this for so long, we've made excellent progress in just two days. To be fair, mental preparation is about 90% of refactoring. I think it would be wise to try and write some test cases that match the behaviours we're expecting to cover when we deploy this feature for real.

This is part of the battle - now;

SamStudio8 commented 3 years ago

Crushing this. Draft PR now over on Majora but there is still a lot to be done. Improved test suite (https://github.com/SamStudio8/majora/commit/121fbc4929c0f3062ad90cb528c24383163a6175) has picked up a few small issues; I've pinned down the form error that was being thrown and caught it which means the collection_date does not need to be provided with partial updates.

SamStudio8 commented 3 years ago

To ease debugging (and because it's just generally useful), I've finally linked the TatlVerb reverse generic relations to the MajoraArtifact model (https://github.com/SamStudio8/majora/commit/abaecc68f904fdda5dfb0ff26dca1b681da0a9af) so we can print out some useful information of an artifact's update history in Majora:

Screenshot from 2021-02-22 15-52-49

We just dump the whole extra_context JSON for now, but we could add some util functions to limit or munge that data in future. Nothing in there is sensitive.

SamStudio8 commented 3 years ago

Given that we want to deprecate the current update_biosample interface reserved for PHA, I'm thinking I'll use that endpoint to check the artifact exists, automatically set the new partial flag, then internally forward the request to add_biosample.

Update: That's actually a little complicated as each endpoint has an f function that is only in the scope of the endpoint. That is, it's quite hard to just say "use the add biosample behaviour" inside the update endpoint. Additionally, the endpoint itself does nothing other than pass its own f to wrap_api_v2, so merely forwarding the request would also mean the OAuth scopes will be incorrect.

Looks like it is time to learn me some class based views!

SamStudio8 commented 3 years ago

https://github.com/SamStudio8/majora/commit/8b037c611f79e303f547a60e5f3e886dafeb8cee removes api_views.add_biosample and api_views.update_biosample and adds a new BiosampleArtifactEndpointView class-based view controller. Going forward I'd like to see all the endpoints migrated to this format as it is superior Django style. There are mixins for updating and creating bundled with Django but they are written with the mindset that GET requests should show an HTML form (although there are JSON mixins too).

At the moment this class view does nothing other than giving us a fancy way to dispatch wrap_api_v2 with the right partial flag and OAuth scopes, but I already have some neat ideas about checking for required fields and taking some gubbins from wrap_api_v2. The current implementation is still a little messy because there is a lot of code in wrap_api_v2 that I don't want to touch right now (Tatl request monitoring and OAuth permission checks), so we use the wrapper as before.

We seem to be heading in the direction of implementing something that is looking a lot like DRF. I wasn't really happy with the DRF v3 API as I don't think all the magic was worth it, but if that is where we indeed end up, then so be it.

Note that this means the partial flag on the client-side is already deprecated. We set it automatically inside the new CBV depending on whether the tail end of the API endpoint name is "add" or "update".

SamStudio8 commented 3 years ago

Now to Ocarina:

That last change will have the most damage if left unchecked. Possible issues that arise from this change:

SamStudio8 commented 3 years ago

I think we're ready to proceed with integrating this patch. Plan is:

SamStudio8 commented 3 years ago

I am satisfied with the uploader test conducted with AU earlier today. Tomorrow morning I will integrate the d10-stomping branch to stable and deploy to production. We can conduct a live weapons test and warn the metadata channel shortly afterwards.

SamStudio8 commented 3 years ago

I've bumped the minimum ocarina version to 0.37.0 so figured it would be best to give a day of notice of the change. I've warned #metadata that this patch will be applied tomorrow morning. I'll get in touch with FS to arrange a time to conduct tests.

SamStudio8 commented 3 years ago

Deployed d10-stomping to prod had the nasty side effect of destroying performance loading the detail_artifact view. My gut feeling was this must be the new audit log I added at the bottom. This worked fine on test which has a much smaller database. I've patched in a db_index on tatl.TatlVerb.content_id which seems to have done the trick (https://github.com/SamStudio8/majora/commit/73b76fa3285550cf895977e6a6c3bf3e750031fb).

I've notified #metadata-apis of the change, and asked AU to loop me in on any user reports arising today. FS has had trouble with their script this morning so this might raise an urgent opportunity to test the new interface for real.

SamStudio8 commented 3 years ago

Noticed some HTTP500 coming through shortly after the patch. In my focus on the biosample endpoints, I didn't try adding a library artifact during testing. The addition of the partial kwarg to f needed to be applied to all f across all endpoints, which I hadn't done - raising an unexpected keyword argument error. Awks. I've patched this (https://github.com/SamStudio8/majora/commit/9bc817be6e1c173e0ff8dc216c030a081fb04b89). The affected user has been contacted and the affected data has been re-uploaded.

SamStudio8 commented 3 years ago

Successful test with FS, this will be rolled out to the PHE backfilling scripts.

SamStudio8 commented 3 years ago

Downgrading severity as I've implemented the urgent components and people seem happy! :tada:

SamStudio8 commented 3 years ago

Sadly given the complexities involved handling all the diverse ways data is uploaded through us, an edge case has slipped through the net. A single WSI "empty" sample was partially updated by a PHA through the shiny new interface earlier today.

The new partial endpoint enables validation to be skipped on fields not included in the submission, which is normally fine, except in the special case of a sample that didn't have all the mandatory fields to begin with. The sample had enough metadata to be processed by Elan (which only checks for a magic flag), despite not actually having the mandatory information in place. This has caused early termination of Asklepian as a sample with neither a received or collection date has gummed up the processing of best references.

To mitigate:

Data fix:

I've pruned the sequence from tonight's Asklepian best_ref list and MSA, so it should be able to continue where it left off.

Although the test cases covered a wide variety of sample states, I had forgotten to include an example that specifically called addempty. I was so focused on ensuring the regular uploader behaviour didn't change, I'd forgotten about the subtleties of addempty because it's such a curious interface (in particular, the way created.submitter_user is left blank as a flag). Going further we may want a MajoraArtifact.incomplete flag to raise when empty samples are pushed, and unset it when the model is successfully saved for the first time (after force creation) but I'm too tired to think about that right now.

SamStudio8 commented 3 years ago

There appear to be a few additional samples with a similar fate, but for some reason they were filtered out from the metadata TSV earlier in the process. These also have PAGs in Majora which will need removing, but do not appear to be troubling Asklepian. They will however need to be fixed before midnight Monday, as they will probably trash the outbound pipelines.

SamStudio8 commented 3 years ago

This has been interesting to unpick. My working theory is that these additional samples that are missing from the metadata TSV were uploaded during a short window today, between the Elan resolve_uploads step failing, and me manually triggering a refresh of the latest.tsv and calling resolve_uploads again.

save_manifest is the first task Elan runs and is responsible for copying the latest.tsv manifest ("the metadata table") somewhere safe. I assume what has happened here, is that that step ran successfully and its result was cached. This would explain why samples processed by Elan are missing from the downstream metadata table -- the cached result would not have been the same metadata table as what was read by resolve_uploads.

This may seem surprising as you might have anticipated Nextflow to determine the input file was updated and re-run the step. However, the file is provided as a parameter instead. I've confirmed this theory with nextflow log, to find the workdir and indeed I can date the majora.metadata.tsv manifest to 1601: the same minute Elan started today.

These other samples are therefore in the dataset and will also need removing tomorrow, however as they were missing from the cached manifest, they've not gone on to cause the same kind of trouble as the single sample that made it through (as the manifest controls reconcile_downstream, which builds the FASTA and matched metadata table).

Mitigation:

Data fix:

bad_eggs = models.BiosampleArtifact.objects.filter(created__biosourcesamplingprocess__submission_user__username="███████████", created__biosourcesamplingprocess__collection_date__isnull=True, created__biosourcesamplingprocess__received_date__isnull=True)
>>> for egg in bad_eggs:
...   try:
...     pag = models.PublishedArtifactGroup.objects.get(tagged_artifacts__in=[egg])
...   except models.PublishedArtifactGroup.DoesNotExist:
...     continue
...   for dra in pag.tagged_artifacts.filter(digitalresourceartifact__id__isnull=False):
...     dra.delete()
...   pag.quality_groups.all().delete()
...   pag.delete()

Additionally, a handful of good PAGs were caught up in the same issue today. They should be correctly processed tomorrow when reconcile_downstream rebuilds the merged consensus.

SamStudio8 commented 3 years ago

Save for yesterday's explosion (which I have now written up as an experiment of new ways to report things) this has gone very well. I'm opening a follow-up issue on Majora's repo to finish moving the non-biosample interfaces to partial-friendly methods as they aren't a DIPI priority now.