COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Unlinked libraries in Majora #193

Closed SamStudio8 closed 2 years ago

SamStudio8 commented 2 years ago

Sanger have alerted us to several runs uploaded to Majora that do not link multiple libraries to the same run; causing the get_sequencing step that expands runs to libraries, and libraries to biosamples to leave out a number of artifacts. DJ suggested this recent Majora change is the likely culprit and I believe that is the case too.

This was my poorly thought out change to fix a race condition with private provider data. Moving the DNASequencingProcessRecord to only be created when the process is new has inadvertently caused a 1:1 map between libraries and runs. I am scoping a more appropriate change to roll back this patch; but to also prevent a return of the Foel race condition.

SamStudio8 commented 2 years ago
>>> libraries_on_run = models.DNASequencingProcessRecord.objects.filter(in_artifact__isnull=False).values("in_artifact__id")
>>> count = 0
>>> for lib in models.LibraryArtifact.objects.filter(~Q(id__in=libraries_on_run)):                                                                                                                                                                   
...   for tatl_verb in lib.tatl_history:                                                                                                                                                                                                             
...     if tatl_verb.verb == "CREATE" and tatl_verb.request.timestamp.date() > datetime.date(2022,2,16):
...       count += 1
...       break
... 
>>> count
104

104 libraries affected

SamStudio8 commented 2 years ago

103 libraries are from Sanger, one was via the Foel ingest

SamStudio8 commented 2 years ago

Addressed by https://github.com/SamStudio8/majora2/commit/0507c0d0a75d4bea6cae224ecb99227c3d69b50a, tested on Magenta, merging to prod Majora

SamStudio8 commented 2 years ago
Applying majora2.0150_majoraartifactprocessrecord_unique_name... OK
SamStudio8 commented 2 years ago

I've used Foel to replay the single non-Sanger affected library and the library is now correctly linked through to a sequencing run. I've advised DJ on next steps and negotiated that we divide up the missing libraries over today and tomorrow just to avoid a cripplingly large Elan run on the weekend.

SamStudio8 commented 2 years ago

DJ asked whether it would be OK to reupload the affected runs entirely (complete with the missing and not-missing libraries). I've realised this will have the side effect of creating links again for anyone uploading old data; so I'll put together a migration to back fill the unique_name for DNASequencingProcessRecord.

SamStudio8 commented 2 years ago

Needed to extend the unique_name field to fit the oddly long private run names with a library. I should have just used hashlib to make a hex string but whatever, it might be useful to have these human readable in future.

SamStudio8 commented 2 years ago
def assign_uniquename(apps, schema_editor):
    DNASequencingProcessRecord = apps.get_model("majora2", "DNASequencingProcessRecord")
    DNASequencingProcess = apps.get_model("majora2", "DNASequencingProcess")
    for record in DNASequencingProcessRecord.objects.all():
        run = DNASequencingProcess.objects.get(id=record.process.id)
        run_name = run.run_name
        library_name = record.in_artifact.dice_name
        record.unique_name = "%s-%s" % (run_name, library_name)
        record.save()
Applying majora2.0152_backfill_dnasequencingprocessrecord_uniquename... OK
SamStudio8 commented 2 years ago

DJ has replayed one run and we are both satisfied with our inspections. We'll replay the oldest affected runs and DJ tells me an automated housekeeping process will pick up some of the most recent ones. We can fill in the gap in the middle tomorrow.

SamStudio8 commented 2 years ago

Down to 86 libraries now

>>> libraries_on_run = models.DNASequencingProcessRecord.objects.filter(in_artifact__isnull=False).values("in_artifact__id")
>>> models.LibraryArtifact.objects.filter(~Q(id__in=libraries_on_run)).count()
86
SamStudio8 commented 2 years ago

Elan processed another 15k or so this morning. DJ has replayed the remaining libraries to Majora for linking. There are 0 libraries now outstanding. Data integrity should be restored by a successful run of Elan tomorrow.

SamStudio8 commented 2 years ago

Finished processing the backlog today -- integrity restored