Closed SamStudio8 closed 2 years ago
>>> libraries_on_run = models.DNASequencingProcessRecord.objects.filter(in_artifact__isnull=False).values("in_artifact__id")
>>> count = 0
>>> for lib in models.LibraryArtifact.objects.filter(~Q(id__in=libraries_on_run)):
... for tatl_verb in lib.tatl_history:
... if tatl_verb.verb == "CREATE" and tatl_verb.request.timestamp.date() > datetime.date(2022,2,16):
... count += 1
... break
...
>>> count
104
104 libraries affected
103 libraries are from Sanger, one was via the Foel ingest
Addressed by https://github.com/SamStudio8/majora2/commit/0507c0d0a75d4bea6cae224ecb99227c3d69b50a, tested on Magenta, merging to prod Majora
Applying majora2.0150_majoraartifactprocessrecord_unique_name... OK
I've used Foel to replay the single non-Sanger affected library and the library is now correctly linked through to a sequencing run. I've advised DJ on next steps and negotiated that we divide up the missing libraries over today and tomorrow just to avoid a cripplingly large Elan run on the weekend.
DJ asked whether it would be OK to reupload the affected runs entirely (complete with the missing and not-missing libraries). I've realised this will have the side effect of creating links again for anyone uploading old data; so I'll put together a migration to back fill the unique_name for DNASequencingProcessRecord
.
Needed to extend the unique_name field to fit the oddly long private run names with a library. I should have just used hashlib
to make a hex string but whatever, it might be useful to have these human readable in future.
def assign_uniquename(apps, schema_editor):
DNASequencingProcessRecord = apps.get_model("majora2", "DNASequencingProcessRecord")
DNASequencingProcess = apps.get_model("majora2", "DNASequencingProcess")
for record in DNASequencingProcessRecord.objects.all():
run = DNASequencingProcess.objects.get(id=record.process.id)
run_name = run.run_name
library_name = record.in_artifact.dice_name
record.unique_name = "%s-%s" % (run_name, library_name)
record.save()
Applying majora2.0152_backfill_dnasequencingprocessrecord_uniquename... OK
DJ has replayed one run and we are both satisfied with our inspections. We'll replay the oldest affected runs and DJ tells me an automated housekeeping process will pick up some of the most recent ones. We can fill in the gap in the middle tomorrow.
Down to 86 libraries now
>>> libraries_on_run = models.DNASequencingProcessRecord.objects.filter(in_artifact__isnull=False).values("in_artifact__id")
>>> models.LibraryArtifact.objects.filter(~Q(id__in=libraries_on_run)).count()
86
Elan processed another 15k or so this morning. DJ has replayed the remaining libraries to Majora for linking. There are 0 libraries now outstanding. Data integrity should be restored by a successful run of Elan tomorrow.
Finished processing the backlog today -- integrity restored
Sanger have alerted us to several runs uploaded to Majora that do not link multiple libraries to the same run; causing the
get_sequencing
step that expands runs to libraries, and libraries to biosamples to leave out a number of artifacts. DJ suggested this recent Majora change is the likely culprit and I believe that is the case too.This was my poorly thought out change to fix a race condition with private provider data. Moving the
DNASequencingProcessRecord
to only be created when the process is new has inadvertently caused a 1:1 map between libraries and runs. I am scoping a more appropriate change to roll back this patch; but to also prevent a return of the Foel race condition.