Closed SamStudio8 closed 3 years ago
Bumped by #8 (outbound upload continuity takes priority) and #13 (local work takes priority).
Bumped by #17 (PHA work takes priority) Update: Bumped by #20 (critical performance work takes priority)
Bumped by #26 (PHA work takes priority)
In lieu of new definitions to help toward #26, I'm bumping #10 back to the top of the queue as it has finally become a threat. I'll need to revisit the code itself (the lines in question will be almost a year old now), but I am imagining:
api_views
flag to determine create or update mode (let's assume the uploader only ever stomps to maintain backward compat for now)form_handler
will use update flag and override .get
None
default with value of current objectMulling over a coffee I don't see how we can avoid trouble partially validating a form without backfilling the form data with initial
. Options are limited here:
ModelForm
; eg. biosamples and collections will be validated separately with ModelForms as part of the same queryI'll look for some best practice guidance this afternoon. First port of call will be to write tests that improve the coverage of these forms to ensure any later changes don't disrupt.
Testing and coverage now public at Travis and codecov. Test suite improvements for biosamples: https://github.com/SamStudio8/majora/commit/f6f90c5e360f9fa6b9086f65a779bca227ed899e, https://github.com/SamStudio8/majora/commit/3846a14c6b8bcbb57462c7fc64ada418519219d7.
This is an annoyingly philosophical problem. I made the conscious decision last year when I wrote these to lose some of the benefits of the ModelForm
in favour of writing a form that actually represents the API interface. I think that this is more clear and easier to maintain, but I must now balance whether the freebies for using the ModelForm
outweigh keeping a good representation of the API.
For example, the form to add biosamples covers fields from the BiosampleArtifact
, BiosourceSamplingProcess
and BiosampleSource
models; if we instead used the ModelForm
we'd have to maintain the fields and their validators across the three distinct forms and control them from the form_handler
which I anticipate could be difficult giving how tangled these models are.
An alternative might be to leverage the existing as_struct
on each model and use those fields to populate the form. Handily, the struct version will match the form fields (as there are a couple of form/model discrepancies such as adm1
and collection_location_adm1
). The pro of this, the work is very much already in place. The con is that I was trying to deprecate it because it is too inefficient for the the v3 API, and it mostly serves to clutter up the model specifications. If we do this, note that some fields are still suppressed through as_struct
(restricted) so we would need to add a mode to ensure they can be read for populating form fields.
It is perhaps worth investigating an initial implementation of the ModelForm
route to see what Django magic we get for free. The pro will be removing a loft of duplicate cruft from forms
(although I do like the explicit representation of the forms here), and obviously automatic population of form fields when updating an object. If it works out well it will likely benefit the code base, but the con is that this is more substantial work; and I wonder how much work this really saves us -- how much hacking will we need to add in form_handlers
to orchestrate the three tightly coupled forms and objects? What about repopulating the biosample_source_id
when it needs to be fetched from the process tree? What happens if the second or third ModelForm
fails validation but we've already taken action on the first two?
https://github.com/SamStudio8/majora/commit/7916d90ea7d9e17472cf3b5afc3c78f878dc2439 introduces additional test coverage for the registration of biosamples (but not the COGUK supplemental), as well as a test designed to demonstrate the perils of stomping. These tests should now provide a proper platform on which to test the effects of any proposed fixes to the infamous stomping problem.
I'm still not entirely convinced with the ModelForm
approach. The magic that Django uses to fill in the form from the model instance is hilariously similar to the as_struct
method that I'd made for the v2 API which isn't exactly rocket science. However, it's hard to make an informed decision without trying to integrate an example. The easiest model for which to do this is one that I'd neglected to list earlier, the COGUK_BiosourceSamplingProcessSupplement
. This model isn't complex; its field names match the form names, and the validation doesn't have any special additional rules either.
https://github.com/SamStudio8/majora/commit/1bb5aacf711e297f7682bf88db6bc388022e48b4 removes the form_handling
for the supplemental model and turns it over to the magic ModelForm
instead. These fields have been added to the biosample tests to show that the change does not damage how supplemental information is created or updated (https://github.com/SamStudio8/majora/commit/5e2dc2991c0d719548427bb790efa2654929cc88). It was somewhat painless.
Frustratingly, I've realised the issue at hand won't be solved with ModelForm
magic after all; sending null
values will still cause the ModelForm
to dutifully overwrite these fields. Options are:
null
(and zero length?) keys from the incoming form datanull
(and zero length?) keys from the incoming form data from the instanceNone
(3) isn't possible because we need the forms to be able to validate with partial data to support updates in the first place.
(2) is basically the as_struct
proposal from earlier. (1) is effectively (2) with the potential bonus of removing some cruft from forms
and form_handlers
.
Sorry I'm talking rubbish, I'm conflating two issues here; a) partial updates are not possible because the validators will fail without a complete model, b) uploading a bunch of null
will blow up the model. B is actually intended behaviour and we still want to allow overwriting of data with null
, this will also ensure the metadata uploader continues to work in a stompy fashion for backwards compat. We don't want to drop null
from the input data, but we do want to use the ModelForm
to fill in missing data such that the validators can do their job and accomplish A.
After some experiments last night I've discovered Form
and ModelForm
are automatically filling missing POST fields with None
, which makes it very difficult to support partial forms. I think this is why I was confusing myself yesterday. This is because the form's _clean_fields
calls value_from_datadict
on the form widget which uses the dict get
, returning None
if the field is not in the request.
Further investigation for a workaround needed, presumably we'll have to hack the fields
or exclude
list on the ModelForm
. This is exactly the sort of thing that DRF was for, I guess.
I'm not entirely convinced given the ModelForm
doesn't even work as magically as hoped that it will be worth the effort. What happens if a future validator depends on fields from more than one model -- this would be almost impossible to handle with ModelForm
.
If you want something to work the way you want, you're just gonna have to do it yourself. I've subclassed the Django forms.ModelForm
, and overridden the __init__
to accept two keyword arguments: partial
and partial_request_keys
, if present, these will post-filter the fields
property (not base_fields
) to forcibly boot the missing fields off the form (https://github.com/SamStudio8/majora/commit/2e01512e862d5d8108c9ced9c4dce539add93c16). This means when it comes to the cleaning step, the form will only call validation (and fill in the cleaned data dict) for the fields that have been sent, neatly this also limits updates to the instance
to those fields too.
+ 468 class MajoraPossiblePartialModelForm(forms.ModelForm): + 469 def __init__(self, *args, **kwargs):
+ 470 partial = kwargs.pop('partial', False)
+ 471 partial_request_keys = kwargs.pop('partial_request_keys', [])
+ 472 super().__init__(*args, **kwargs)
+ 473
+ 474 # Drop any fields that are not specified in the request
+ 475 if partial and partial_request_keys:
+ 476 allowed = set(partial_request_keys)
+ 477 existing = set(self.fields)
+ 478 for field_name in existing - allowed:
+ 479 self.fields.pop(field_name)
I've integrated this example (https://github.com/SamStudio8/majora/commit/1bb5aacf711e297f7682bf88db6bc388022e48b4). A new test test_biosample_add_update_partial_supp_stomp
(https://github.com/SamStudio8/majora/commit/aa4513a9bd9cb799ad38d536e505870018a50866#diff-16d4692c80c1f313ffd79c9896781b0f71de4c3c96bdfacf16ce06ba44f4c237R413) attempts to set the COG supplemental fields one by one (ie. with missing fields that would previously set everything back to None
). The test checks that the instance has both changed from the last update, and matches the expected new state. This test now passes -- you're looking at a working implementation of a solution to the stomping problem :100:
The supplemental data model is small fry compared to the complexity of the other models involved in the biosample add endpoint, and I'm still not entirely convinced the ModelForm
is the way to go, but we've got a lead at least.
Note that this implementation will break ungracefully when a user tries to create a new object with partial=True
and fails to provide any field that is required
on the underlying model.
My next candidate is the BiosourceSamplingProcess
, partly because it seems to be easier to refactor this function by starting at the end and working back, but also because it'll prove whether the ModelForm
strategy is garbage. We present "nicer" field names to users (eg. collection_location_adm1
is presented as adm1
) which will need to be mapped somewhere. I'm thinking the most straightforward way to do this is use the modify_preform
to just rename the fields. We can presumably jam something into the Meta
to hold the canonical mapping.
I've gone ahead and assumed that storing the field mapping in the form's Meta
class is a great idea https://github.com/SamStudio8/majora/commit/301da1f8545422d0ba58eef68bf8c09fe9d8db21
With much more frustration than planned for, I've wrangled the BiosourceSamplingProcess
to a stomp-safe implementation. I've added a map_request_fields
function to the earlier superclass that maps request data to the real model field names according to the Meta.field_map
of the form instance. We intercept data
and initial
before they head up to BaseForm
via super
to fix field names, and also intercept the _errors
in _post_clean
to spit errors with the user-facing field names back to clients. It's pretty cool, actually.
Honestly this was more work than intended, and I worry we've injected a whole load of Majora-specific magic that is less clear now the ModelForm
is taking care of things. Aside from that, the test case shows that the stomp-free implementation works. However, it's also discovered an edge case where adding an error to the form inside clean
will raise a ValueError
if the field has been suppressed from the form (by Django's design). This only happens in cases where we add an error to a field on the form based on another field (e.g. adding a warning to received_date
if collection_date
is also missing). The problem here is that the form clean
is processed regardless of whether the fields are present or not. I'm sure we'll find out how much of an issue that is, eventually.
self.partial
in clean
to absorb errors that don't matter when updatingThe groundwork laid for the sampling process meant that the remaining pieces were actually quite straightforward, the BiosampleSource
and BiosampleArtifact
are now also stomp free. https://github.com/SamStudio8/majora/commit/f868d06db9d4f038243531ee5bae35beb0ecabe6 completely removes the form_handlers.handle_testsample
function, all the heavy lifting is done exclusively by api_views.add_biosample
. The interface now takes a partial
attribute that enables my partial updates magic on the superclassed ModelForm
(disabled by default).
The test_biosample_add_update_partial_supp_stomp
test case covers all user-reachable fields and appears to show that creating and updating (partial or otherwise) work as expected. There is some remaining weirdness with None
vs. ''
that I don't entirely understand, I suspect I need to formally remove null=True
from the underlying models as that is not good practice.
Shockingly, after avoiding this for so long, we've made excellent progress in just two days. To be fair, mental preparation is about 90% of refactoring. I think it would be wise to try and write some test cases that match the behaviours we're expecting to cover when we deploy this feature for real.
This is part of the battle - now;
update
interface to take advantage of this new behaviourCrushing this. Draft PR now over on Majora but there is still a lot to be done. Improved test suite (https://github.com/SamStudio8/majora/commit/121fbc4929c0f3062ad90cb528c24383163a6175) has picked up a few small issues; I've pinned down the form error that was being thrown and caught it which means the collection_date
does not need to be provided with partial updates.
To ease debugging (and because it's just generally useful), I've finally linked the TatlVerb
reverse generic relations to the MajoraArtifact
model (https://github.com/SamStudio8/majora/commit/abaecc68f904fdda5dfb0ff26dca1b681da0a9af) so we can print out some useful information of an artifact's update history in Majora:
We just dump the whole extra_context
JSON for now, but we could add some util functions to limit or munge that data in future. Nothing in there is sensitive.
Given that we want to deprecate the current update_biosample
interface reserved for PHA, I'm thinking I'll use that endpoint to check the artifact exists, automatically set the new partial
flag, then internally forward the request to add_biosample
.
Update: That's actually a little complicated as each endpoint has an f
function that is only in the scope of the endpoint. That is, it's quite hard to just say "use the add biosample behaviour" inside the update endpoint. Additionally, the endpoint itself does nothing other than pass its own f
to wrap_api_v2
, so merely forwarding the request would also mean the OAuth scopes will be incorrect.
Looks like it is time to learn me some class based views!
https://github.com/SamStudio8/majora/commit/8b037c611f79e303f547a60e5f3e886dafeb8cee removes api_views.add_biosample
and api_views.update_biosample
and adds a new BiosampleArtifactEndpointView
class-based view controller. Going forward I'd like to see all the endpoints migrated to this format as it is superior Django style. There are mixins for updating and creating bundled with Django but they are written with the mindset that GET
requests should show an HTML form (although there are JSON mixins too).
At the moment this class view does nothing other than giving us a fancy way to dispatch wrap_api_v2
with the right partial
flag and OAuth scopes, but I already have some neat ideas about checking for required fields and taking some gubbins from wrap_api_v2
. The current implementation is still a little messy because there is a lot of code in wrap_api_v2
that I don't want to touch right now (Tatl
request monitoring and OAuth permission checks), so we use the wrapper as before.
We seem to be heading in the direction of implementing something that is looking a lot like DRF. I wasn't really happy with the DRF v3 API as I don't think all the magic was worth it, but if that is where we indeed end up, then so be it.
Note that this means the partial
flag on the client-side is already deprecated. We set it automatically inside the new CBV depending on whether the tail end of the API endpoint name is "add" or "update".
Now to Ocarina:
--partial
flag to the put
parser. Adding a new action command would involve more work, this allows us to use all the existing put
parsing.--partial
: https://github.com/SamStudio8/ocarina/commit/c96348dc13e095f50dd90e875f1097063bbf260e changes the required setting to use a function that checks whether --partial
is in sys.argv
, a bit gross but it worksnull
fields if they were not set on the CLI: https://github.com/SamStudio8/ocarina/commit/78df1291a93dacd1eca84946c008d54ad7c2d3b6 now suppressed null
from being sent to the API by dropping any keys set to None
unless the user sets them to the string _null_
on the CLI instead.That last change will have the most damage if left unchecked. Possible issues that arise from this change:
_null_
or risk not nulling things that are meant to be overwritten with null (this is low risk as most users don't know they can do this in the first place)json_data
. This is also low risk as the new ModelForm
stuff should handle this, but could cause trouble if we strip out a mode
field or something by mistake. Further testing required.metadata
and metrics
keys will be sent as {}
. This shouldn't cause those structs to be destroyed on Majora-side, but I will double check the test suite.I think we're ready to proceed with integrating this patch. Plan is:
None
problem that has arisen from new endpoint (probably need to remove nullable=True
from model fields as per best practice)~ this is not a blocking issuemetadata
and metric
keys (am 80% sure the tests cover this case already)nulled_fields
to extra_context
in TatlVerb
https://github.com/SamStudio8/majora/commit/12473e49ffeabcdceede3eec2b1d8b2ae6f678aametadata
and metrics
https://github.com/SamStudio8/majora/commit/c2c148b12db62fc09e8cede2a23cf423dee2221f, https://github.com/SamStudio8/majora/commit/c38bae55c0d609e7bae770d389c36c6ff8dd374bOcarina#d10
to 0.37.0
#metadata
about the change and advise sites be on the lookout for anything spurious, just in case (2021-03-04 AM)Majora#d10
to production (2021-03-05 AM)root_biosample_source_id
with Ocarina (2021-03-05 PM)collection_pillar
. May need to give guidance if WSI are still using their own client--partial
and warn that _null_
must be provided to Ocarina
(not Majora) to handle null
I am satisfied with the uploader test conducted with AU earlier today. Tomorrow morning I will integrate the d10-stomping branch to stable and deploy to production. We can conduct a live weapons test and warn the metadata channel shortly afterwards.
I've bumped the minimum ocarina version to 0.37.0
so figured it would be best to give a day of notice of the change. I've warned #metadata
that this patch will be applied tomorrow morning. I'll get in touch with FS to arrange a time to conduct tests.
Deployed d10-stomping
to prod had the nasty side effect of destroying performance loading the detail_artifact
view. My gut feeling was this must be the new audit log I added at the bottom. This worked fine on test which has a much smaller database. I've patched in a db_index
on tatl.TatlVerb.content_id
which seems to have done the trick (https://github.com/SamStudio8/majora/commit/73b76fa3285550cf895977e6a6c3bf3e750031fb).
I've notified #metadata-apis
of the change, and asked AU to loop me in on any user reports arising today. FS has had trouble with their script this morning so this might raise an urgent opportunity to test the new interface for real.
Noticed some HTTP500
coming through shortly after the patch. In my focus on the biosample endpoints, I didn't try adding a library artifact during testing. The addition of the partial
kwarg to f
needed to be applied to all f
across all endpoints, which I hadn't done - raising an unexpected keyword argument error. Awks. I've patched this (https://github.com/SamStudio8/majora/commit/9bc817be6e1c173e0ff8dc216c030a081fb04b89). The affected user has been contacted and the affected data has been re-uploaded.
Successful test with FS, this will be rolled out to the PHE backfilling scripts.
Downgrading severity as I've implemented the urgent components and people seem happy! :tada:
Sadly given the complexities involved handling all the diverse ways data is uploaded through us, an edge case has slipped through the net. A single WSI "empty" sample was partially updated by a PHA through the shiny new interface earlier today.
The new partial
endpoint enables validation to be skipped on fields not included in the submission, which is normally fine, except in the special case of a sample that didn't have all the mandatory fields to begin with. The sample had enough metadata to be processed by Elan (which only checks for a magic flag), despite not actually having the mandatory information in place. This has caused early termination of Asklepian as a sample with neither a received or collection date has gummed up the processing of best references.
To mitigate:
ocarina_resolve
, we distribute the metadata table as produced by Ocarina, so we need to nuke these bad rows there. (Future: The v3+ API should just not send bad rows to clients at all).Data fix:
I've pruned the sequence from tonight's Asklepian best_ref list and MSA, so it should be able to continue where it left off.
Although the test cases covered a wide variety of sample states, I had forgotten to include an example that specifically called addempty
. I was so focused on ensuring the regular uploader behaviour didn't change, I'd forgotten about the subtleties of addempty
because it's such a curious interface (in particular, the way created.submitter_user
is left blank as a flag). Going further we may want a MajoraArtifact.incomplete
flag to raise when empty samples are pushed, and unset it when the model is successfully saved for the first time (after force creation) but I'm too tired to think about that right now.
There appear to be a few additional samples with a similar fate, but for some reason they were filtered out from the metadata TSV earlier in the process. These also have PAGs in Majora which will need removing, but do not appear to be troubling Asklepian. They will however need to be fixed before midnight Monday, as they will probably trash the outbound pipelines.
This has been interesting to unpick. My working theory is that these additional samples that are missing from the metadata TSV were uploaded during a short window today, between the Elan resolve_uploads
step failing, and me manually triggering a refresh of the latest.tsv
and calling resolve_uploads
again.
save_manifest
is the first task Elan runs and is responsible for copying the latest.tsv
manifest ("the metadata table") somewhere safe. I assume what has happened here, is that that step ran successfully and its result was cached. This would explain why samples processed by Elan are missing from the downstream metadata table -- the cached result would not have been the same metadata table as what was read by resolve_uploads
.
This may seem surprising as you might have anticipated Nextflow to determine the input file was updated and re-run the step. However, the file is provided as a parameter instead. I've confirmed this theory with nextflow log
, to find the workdir
and indeed I can date the majora.metadata.tsv
manifest to 1601: the same minute Elan started today.
These other samples are therefore in the dataset and will also need removing tomorrow, however as they were missing from the cached manifest, they've not gone on to cause the same kind of trouble as the single sample that made it through (as the manifest controls reconcile_downstream
, which builds the FASTA and matched metadata table).
Mitigation:
resolve_uploads
to use the output manifest from save_manifest
https://github.com/SamStudio8/elan-nextflow/commit/67ae19f975ba809cfa2e89d4ddd38e1b907375c3 -- we might consider setting cache false
on save_manifest
, but we need to be more careful first -- anything that updates latest.tsv
can damage a restarted Elan pipeData fix:
bad_eggs = models.BiosampleArtifact.objects.filter(created__biosourcesamplingprocess__submission_user__username="███████████", created__biosourcesamplingprocess__collection_date__isnull=True, created__biosourcesamplingprocess__received_date__isnull=True)
>>> for egg in bad_eggs:
... try:
... pag = models.PublishedArtifactGroup.objects.get(tagged_artifacts__in=[egg])
... except models.PublishedArtifactGroup.DoesNotExist:
... continue
... for dra in pag.tagged_artifacts.filter(digitalresourceartifact__id__isnull=False):
... dra.delete()
... pag.quality_groups.all().delete()
... pag.delete()
Additionally, a handful of good PAGs were caught up in the same issue today. They should be correctly processed tomorrow when reconcile_downstream
rebuilds the merged consensus.
Save for yesterday's explosion (which I have now written up as an experiment of new ways to report things) this has gone very well. I'm opening a follow-up issue on Majora's repo to finish moving the non-biosample interfaces to partial
-friendly methods as they aren't a DIPI priority now.
With PHA now looking to backfill data, the pressure is finally high enough to pick up the support for PATCH requests to the API to update sample fields. See https://github.com/SamStudio8/majora/issues/32.