WP4 - API to upload target dataset - followup actions

duncanpeacock commented 3 years ago

This is a placeholder for follow-up actions to implement the data loader API now that the first version (Minimum Viable Product) has been merged.

Implementation actions - covalent etc. functionality should be added to the data validation/creation repo.
Any changes required for the fragalysis loader? (See version 1)
Final status of authentication/authorisation (addition of owner-id??)
Do we need a status field for target during upload so that the react front "knows" that the target is being updated - or that the target is temporarily removed from the list of targets.

duncanpeacock commented 3 years ago

Two initial requests for changes from discussions on the upload target set functionality:

The upload screen should make it clear to the user that they can close the browser window and the upload will continue.
Also that they can save the link and come back and look.
Add functionality to send an email when upload/validation is complete with any relavent details.

duncanpeacock commented 3 years ago

When the running the data loader to load new targets, the inspirations seem to disappear. This is likely to be because of references to primary keys being regenerated in the upload. Change new loader API so that the current references for computed sets linked to a target are saved and then restored when the target is uploaded.

duncanpeacock commented 3 years ago

Rachael has successfully tested the load. There is some remaining work and this can be tracked under the follow-up actions.

Agreed roadmap for the remaining target upload tasks following meeting on 17/12/2020:

It was confirmed that the current limitation of uploading one target dataset at a time is OK for now as most users of fragalysis will be interested in one particular dataset. There is no pressing need for a mass loader. This can be re-added/budgeted for at a future moment if necessary.
Testing will continue at Diamond as part of the preparations for rollout.
The current loader should be retained until the new loader code is fully operation/being used. At that point, at a suitable moment, it will be removed from the stack. This will be tracked as a separate issue on the fragalysis backend.

The remaining followup tasks on the loader to be tracked by this task are as follows:

The upload screen should make it clear to the user that they can close the browser window and the upload will continue and that they can save the link and come back and look.
Add functionality to send an email when upload/validation is complete with any relevant details.
Fix the compound set problem (see below)

Problem

When a target set is reloaded, links to existing compound sets are broken

Initial prognosis is as follows:

Fragalysis backend repo:

In tasks.process_design_compound:

The inspirations field in the compound model links to a manytomany field to the molecules model/table.
The likely cause is that when a new target is uploaded, it wipes out the link -> so the compound sets aren't visible.

Likely solution:

When a target set is uploaded, examine where molecules are removed/added and make sure that the many to many field is retained.

Place to start looking: targate_set_upload.analyse_mols

for mol_id in ids: if mol_id not in [a['id'] for a in mol_group.mol_id.values()]: print(mol_id) this_mol = Molecule.objects.get(id=mol_id) mol_group.mol_id.add(this_mol)

Is the manytomany field correct after the reload? otherwise it needs to be saved/replaced.

duncanpeacock commented 3 years ago

Analysis

When the proteins are loaded, existing proteins with alternate names are actually deleted and recreated rather than updated. This changes the id and breaks links. This has been confirmed by running the Mpro upload multiple times. The number of proteins stays the same, but the auto-incremented id increases each time by 295.

The problem is caused by the update to Protein.code that is made when the Protein has an alternate name.

The processing is as follows (all in target_set_upload.py - but also existing in the current loader):

Protein.code initially comes from the directory in the aligned folder – e.g Mpro-x1101_0A
New proteins with these codes are written in the function add_prot
Proteins for the target where the code is not in the list of folders are removed (remove_not_added function).
If the aligned folder also contains a metadata.csv file, this is processed and any alternate names are written to the alternate_names.csv file.
Then later on, in function rename_mol, Protein.code is modified as follows: new_name = str(mol_target).replace('_0', '') + ':' + str(alternate_name).strip()

This produces difference results depending on whether the folder name has "_0" in it or not: e.g.
Mpro-x0072A_0A becomes: Mpro-x0072A:AAR-POS-d2a4d1df-1 Mpro-x1101_0A becomes: Mpro-x1101A:AAR-POS-0daf6b7e-40 Mpro-x1101_1A becomes: Mpro-x1101_1A:AAR-POS-0daf6b7e-40

When the data is loaded again, the codes can't be found.because they have been modified.

Our first attempt at a fix failed because we tried to just use the part up to the colon, but that only works if the whole of the folder is in the key, not for the ones where the '_0' is stripped off, which is the normal situation.

Solution:

One possible solution is to:

Get the bit before the colon.
Do the replace '_0'
Use this to find initial molecules and update those.

But at the moment the remove_not_added function would fail. I can probably fix this by doing the same thing in the remove_not_added function (or make a list of the keys I've matched and get rid of all the others)

Questions:

Is the "_0" replacement working as desired? The result for Mpro-x1101_1A looks a bit odd.
Are there knock on affects?
What about the current loader - should I fix that too?

duncanpeacock commented 3 years ago

Discussed with Frank: Decision is to remove the code to replace '_0'. Code will always be original folder:alternate_name like the last example. Mpro-x1101_1A becomes: Mpro-x1101_1A:AAR-POS-0daf6b7e-40

The current loader also needs to be fixed. Will raise/fix an issue on the fragalysis-loader repo.

duncanpeacock commented 3 years ago

The data upload problem is solved as per my previous message. Currently working on the other changes (screen comms/email notification).

duncanpeacock commented 3 years ago

I also needed to change the compound set uploader so that when it checked protein.code, it checked up to ":" rather than "_" so it would comply with the new names.

duncanpeacock / fragalysis-backend