Open duncanpeacock opened 3 years ago
Two initial requests for changes from discussions on the upload target set functionality:
When the running the data loader to load new targets, the inspirations seem to disappear. This is likely to be because of references to primary keys being regenerated in the upload. Change new loader API so that the current references for computed sets linked to a target are saved and then restored when the target is uploaded.
Rachael has successfully tested the load. There is some remaining work and this can be tracked under the follow-up actions.
Agreed roadmap for the remaining target upload tasks following meeting on 17/12/2020:
The remaining followup tasks on the loader to be tracked by this task are as follows:
When a target set is reloaded, links to existing compound sets are broken
Initial prognosis is as follows:
Fragalysis backend repo:
In tasks.process_design_compound:
The inspirations field in the compound model links to a manytomany field to the molecules model/table.
The likely cause is that when a new target is uploaded, it wipes out the link -> so the compound sets aren't visible.
Likely solution:
When a target set is uploaded, examine where molecules are removed/added and make sure that the many to many field is retained.
Place to start looking: targate_set_upload.analyse_mols
for mol_id in ids: if mol_id not in [a['id'] for a in mol_group.mol_id.values()]: print(mol_id) this_mol = Molecule.objects.get(id=mol_id) mol_group.mol_id.add(this_mol)
Is the manytomany field correct after the reload? otherwise it needs to be saved/replaced.
When the proteins are loaded, existing proteins with alternate names are actually deleted and recreated rather than updated. This changes the id and breaks links. This has been confirmed by running the Mpro upload multiple times. The number of proteins stays the same, but the auto-incremented id increases each time by 295.
The problem is caused by the update to Protein.code that is made when the Protein has an alternate name.
The processing is as follows (all in target_set_upload.py - but also existing in the current loader):
This produces difference results depending on whether the folder name has "_0" in it or not:
e.g.
Mpro-x0072A_0A becomes: Mpro-x0072A:AAR-POS-d2a4d1df-1
Mpro-x1101_0A becomes: Mpro-x1101A:AAR-POS-0daf6b7e-40
Mpro-x1101_1A becomes: Mpro-x1101_1A:AAR-POS-0daf6b7e-40
Our first attempt at a fix failed because we tried to just use the part up to the colon, but that only works if the whole of the folder is in the key, not for the ones where the '_0' is stripped off, which is the normal situation.
One possible solution is to:
But at the moment the remove_not_added function would fail. I can probably fix this by doing the same thing in the remove_not_added function (or make a list of the keys I've matched and get rid of all the others)
Discussed with Frank: Decision is to remove the code to replace '_0'. Code will always be original folder:alternate_name like the last example. Mpro-x1101_1A becomes: Mpro-x1101_1A:AAR-POS-0daf6b7e-40
The current loader also needs to be fixed. Will raise/fix an issue on the fragalysis-loader repo.
The data upload problem is solved as per my previous message. Currently working on the other changes (screen comms/email notification).
I also needed to change the compound set uploader so that when it checked protein.code, it checked up to ":" rather than "_" so it would comply with the new names.
This is a placeholder for follow-up actions to implement the data loader API now that the first version (Minimum Viable Product) has been merged.