Closed SamStudio8 closed 2 years ago
Yesterday I should have noticed that there was no new work to do when republishing the data set; this should have been grounds to stop what we were doing as we were writing OVER the data set (and so there should have been plenty of work to do). The key mistake was I ran the first step of cog-publish
BEFORE repointing the published/head
symlink.
The name coded into the head
dir name is used to populate $LAST_DATE
(https://github.com/SamStudio8/elan-nextflow/blob/master/bin/control/cog-publish.sh#L11). By not repointing this, Majora had no new genomes to publish. Although I correctly repointed the dir to run reconcile
step shortly afterward -- I only double checked the bad data had been removed (not realising all the rest had been removed too!).
I think we have two options:
head
at 20220104 which could potentially republish the missing data (as well as today's) by extending the window in which reconcile is allowed to hit the file systemPublishedArtifactGroup.published_date
for the affected data in Majora, meaning they will look new for 2022-01-07I'm tending towards the latter, if only because it would be more technically correct for the lost genomes to have their published date changed to the date they were inserted into the data set properly, and it's an easier solution to reason about.
The latter also strikes me as less likely to lead to some unforeseen consequence.
So:
We're going to go with the safer (and technically more correction Option 2). It should be straightforward (famous last words) as we can query Majora for published artifact groups with a published_date
of 2022-01-05 and change them to 2022-01-07 (allowing them to be picked up as brand new tomorrow). I want to make sure we don't miss anything so @BioWilko is comparing the 2022-01-04 and 2022-01-05 data sets to ensure nothing will slip through the cracks. Once we've got an exact count of the number of affected PAGs we can make the update to Majora. We'll be able to check this has worked with Ocarina too.
I'm just chasing up a loose end at PHE as they are reporting the Asklepian genome table was 2 genomes smaller which is a discrepancy that doesn't fit expectations given what's happened.
The number of times I have typed 2021 in here is embarrassing
Wow I actually didn't notice which is possibly worse
I have consulted with the Majora oracle:
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05").count()
14383
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc").count()
14157
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986
>>>
The number of affected sequences is officially 13,986.
I've also cleared up the -2 situation at PHE. I think we're ready to go and update the published dates for the affected sequences.
OK that's done.
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-07", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986
Note that we've left the rejected and suppressed genomes with the original published date because they are unaffected by this problem.
Looking good
[20220107]$ tail -f publish.log
20220107
[CPUB] LAST_DATE=2022-01-06
23466 pass.fasta.ls
23466 pass.bam.ls
141 kill.fasta.ls
141 kill.bam.ls
I have checked the metadata TSV and big MSA and the missing samples appear to have been included in the latest dataset.
During the handling of yesterday's data integrity incident (#178), the data set for 2022-01-05 was republished. It appears during this process the new data for 2022-01-05 was not added back to the data set; essentially republishing the 2022-01-04 data set. These genomes are now missing from the 2022-01-06 data set and will need to be reinserted.