biocore / microsetta-private-api

A private microservice to support The Microsetta Initiative
BSD 3-Clause "New" or "Revised" License
6 stars 19 forks source link

Survey database update v2 #489

Closed charles-cowart closed 1 year ago

charles-cowart commented 1 year ago

Alternative approach to survey/data migration that preserves retired survey templates/groups. Working on final four broken unittests and adding additional tests as appropriate.

charles-cowart commented 1 year ago

@cassidysymons would you still prefer this PR to point to master-overhaul instead of master? I'm reviewing old comments from the v1 pr.

cassidysymons commented 1 year ago

@charles-cowart Yes, please point it at master-overhaul. Thanks!

cassidysymons commented 1 year ago

@charles-cowart Why are you still migrating the old survey responses into new templates? Didn't the idea of retiring templates eliminate the need for that migration?

charles-cowart commented 1 year ago

@cassidysymons The idea of retiring templates was to allow us to migrate data without having to perform a lot of CRUD operations over the legacy survey data. Otherwise we'd need to delete legacy template information and migrate all legacy responses over to new templates. This is a large operation over valuable data and can't be undone, as Daniel pointed out.

The update code is relatively straightforward to audit, reuses survey create and delete functions, and has unit tests. This should help to inspire confidence in its robustness. Creating new surveys allows us to migrate and test non-destructively with respect to the legacy data.

get_survey_metadata() doesn't return survey_ids along with the survey metadata it returns, so migrating it on the fly seem s possible, but it would break the service if a user or developer (now or in the future) expects a BASIC_INFO template to exist and it doesn't.

wasade commented 1 year ago

I'm also surprised by the need for migration. The ag.survey_answers table does not care about what template a survey_question_id is associated with. I thought the plan was to update the associations with ag.survey_question and ag.group_questions, and to not touch ag.survey_answers

cassidysymons commented 1 year ago

@charles-cowart When we discussed the strategy of retiring templates, we discussed adding survey_template_id to ag_login_surveys, so that the only migration necessary would be adding the template_id to that table, rather than creating new records for the old responses. Did something come up that made that unfeasible?

@wasade Based on the comments for this migration that Dan Hakim wrote, we do need to be able to determine the survey_template_id from a record in ag_login_surveys. However, retroactively adding it via a migration seems like the least invasive option at our disposal.

wasade commented 1 year ago

I thought we were making survey_template_id explicit in ag_login_surveys in this PR. Maybe we're mincing words, but I wouldn't consider that a migration as we aren't moving data but creating it.

cassidysymons commented 1 year ago

That was my understanding too, so I think we only have a semantics difference. I think the change would need to use the migration_support.py functionality as it would be incredibly complex to achieve entirely via sql, but yes, it's creating data, not moving it.

charles-cowart commented 1 year ago

I'm also surprised by the need for migration. The ag.survey_answers table does not care about what template a survey_question_id is associated with. I thought the plan was to update the associations with ag.survey_question and ag.group_questions, and to not touch ag.survey_answers

@wasade That's not actually true. survey_answers binds a set of question/response pairs to a survey_id. The template_id is not a column in the table, but a template_id is associated with the survey_id. You have to realize that we fill the unanswered questions in a survey with 'Undetermined' and add those to survey_answers as well, making a fixed set of question/answer pairs bound to a survey_id. The survey_id is the sticking point because everywhere elsewhere in the code a survey_id is assumed to be associated with one and only one template_id forever. It's not something you can really break.

charles-cowart commented 1 year ago

@charles-cowart When we discussed the strategy of retiring templates, we discussed adding survey_template_id to ag_login_surveys, so that the only migration necessary would be adding the template_id to that table, rather than creating new records for the old responses. Did something come up that made that unfeasible?

My understanding was that we'd be able to create the new surveys on-demand if we called this new functionality in get_survey_metadata(), leaving the legacy data alone. Or we could call it once for every barcode in migration. The advantages of the latter is we process a barcode once and only once, without having to add a column or some other metadata to denote that we've migrated a barcode and don't do it again on-demand. However we always needed to maintain the constraint that a survey_id is associated with one and only one template_id permanently. It breaks the code if we don't, and it makes the data inconsistent in survey_answers.

charles-cowart commented 1 year ago

That was my understanding too, so I think we only have a semantics difference. I think the change would need to use the migration_support.py functionality as it would be incredibly complex to achieve entirely via sql, but yes, it's creating data, not moving it.

The population of the new survey_template_id column in ag_login_surveys is actually done in the 0108.sql file. I generate three temp tables at the top of the file and use the successive results to populate the column before deleting them.

cassidysymons commented 1 year ago

If we add the survey_template_id to ag_login_surveys, and leave all of the old survey templates intact (but retired), why do we need to do any sort of migration?

charles-cowart commented 1 year ago

Confirmed the lone error in tests is caused by comparing the observed results of _migrate_responses() with the expected values. The results are identical, but self.assertDictEqual(obs, filled_surveys) fails because elements are in different internal order. Will look into and resolve after the call.

charles-cowart commented 1 year ago

If we add the survey_template_id to ag_login_surveys, and leave all of the old survey templates intact (but retired), why do we need to do any sort of migration?

We aren't migrating data, but we do need to copy it. For survey x of template 'Primary Questionnaire', we need to take those 100 odd questions, filter out the 'Undetermined' non-answers, create new templates with the right subset of questions in each, padding out those templates with their own 'Undetermined' answers for the questions that were not answered, and push them into the system with survey X's timestamp, source id, sample_id, account_id, etc.

The function that does that isn't migrating data per se, and doesn't have to be called in migration.py, but it's a convenient place to do so.

charles-cowart commented 1 year ago

Closed #466 as this PR supersedes it. Specifically, if we remove the component that migrates old surveys into new templates, what is left is code that’s basically a better version of the code written in #466 to populate surveys with legacy answers, with better tests. Since they both touch read_survey_template() in the same way, it’s really one or the other. The only thing left to keep from #466 was the percentage completed code, so I folded it in. I also added additional tests for the Qiita-related functionality as requested in #466.

charles-cowart commented 1 year ago

@cassidysymons This update should have everything but the PM_ to EBI migration and the smallest delta time fix. The privacy test should fail because we changed PM_GENDER_v2 back to normal and I haven't addressed that test yet. I suspect that one test using HumanInfo will fail too, just because I changed the constructor to make it work locally.

charles-cowart commented 1 year ago

@cassidysymons @wasade ostensibly finished! All comments addressed.

wasade commented 1 year ago

Thanks, reviewing in a moment

charles-cowart commented 1 year ago

Thanks Charlie. So I don't see where barcodes are correctly associated with their most temporally similar survey...

Hi @wasade Thanks for the review! I have a method called _get_timestamp() that looks for a survey that is an exact match to a barcode and a survey template id. If it finds one, it'll return that to the code I've highlighted below. If for some reason it can't find a matching survey, it'll assume every source has a creation timestamp and return that instead. It's in a separate method so we could easily tweak it based on your expected feedback.

The code below will take the set of all responses across all times and return them to the caller. (Not sure if that's what we really wanted because some answers change over time and some don't.) Where there are multiple responses to the same question, it will select the response closest to the timestamp returned by _get_timestamp():

https://github.com/charles-cowart/microsetta-private-api/blob/eda55696ea443ff5edd21dcd6e4a93260f99536e/microsetta_private_api/repo/survey_template_repo.py#L1141-L1154.

wasade commented 1 year ago

Thanks, @charles-cowart. _get_timestamp does not interact with ag_kit_barcodes as far as I can tell, so it is not obtaining the sample's timestamp

charles-cowart commented 1 year ago

I resolved the merge conflict with the idea that I didn't change anything substantial to the vioscreen-related method and hence the changes in master-overhaul superseded them.

charles-cowart commented 1 year ago

Reviewing and resolving test errors post merge.

charles-cowart commented 1 year ago

@cassidysymons @wasade Hi guys - I think responding to those two open questions makes this tentatively resolved again. All other comments have been resolved. Please let me know what you think. Best, C

cassidysymons commented 1 year ago

@wasade When you have a moment, can you please give this a look and see if there are any unresolved issues or further changes you think would be valuable?

wasade commented 1 year ago

@cassidysymons I recommend squash and merge given the number of commits