COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Genomes from Elan 20220105 have been unpublished #179

Closed SamStudio8 closed 2 years ago

SamStudio8 commented 2 years ago

During the handling of yesterday's data integrity incident (#178), the data set for 2022-01-05 was republished. It appears during this process the new data for 2022-01-05 was not added back to the data set; essentially republishing the 2022-01-04 data set. These genomes are now missing from the 2022-01-06 data set and will need to be reinserted.

SamStudio8 commented 2 years ago

Yesterday I should have noticed that there was no new work to do when republishing the data set; this should have been grounds to stop what we were doing as we were writing OVER the data set (and so there should have been plenty of work to do). The key mistake was I ran the first step of cog-publish BEFORE repointing the published/head symlink.

The name coded into the head dir name is used to populate $LAST_DATE (https://github.com/SamStudio8/elan-nextflow/blob/master/bin/control/cog-publish.sh#L11). By not repointing this, Majora had no new genomes to publish. Although I correctly repointed the dir to run reconcile step shortly afterward -- I only double checked the bad data had been removed (not realising all the rest had been removed too!).

SamStudio8 commented 2 years ago

I think we have two options:

I'm tending towards the latter, if only because it would be more technically correct for the lost genomes to have their published date changed to the date they were inserted into the data set properly, and it's an easier solution to reason about.

BioWilko commented 2 years ago

The latter also strikes me as less likely to lead to some unforeseen consequence.

SamStudio8 commented 2 years ago

So:

SamStudio8 commented 2 years ago

We're going to go with the safer (and technically more correction Option 2). It should be straightforward (famous last words) as we can query Majora for published artifact groups with a published_date of 2022-01-05 and change them to 2022-01-07 (allowing them to be picked up as brand new tomorrow). I want to make sure we don't miss anything so @BioWilko is comparing the 2022-01-04 and 2022-01-05 data sets to ensure nothing will slip through the cracks. Once we've got an exact count of the number of affected PAGs we can make the update to Majora. We'll be able to check this has worked with Ocarina too.

I'm just chasing up a loose end at PHE as they are reporting the Asklepian genome table was 2 genomes smaller which is a discrepancy that doesn't fit expectations given what's happened.

SamStudio8 commented 2 years ago

The number of times I have typed 2021 in here is embarrassing

BioWilko commented 2 years ago

Wow I actually didn't notice which is possibly worse

SamStudio8 commented 2 years ago

I have consulted with the Majora oracle:

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05").count()                                                                                                                                                                
14383
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc").count()                                                                       
14157
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986
>>> 

The number of affected sequences is officially 13,986.

SamStudio8 commented 2 years ago

I've also cleared up the -2 situation at PHE. I think we're ready to go and update the published dates for the affected sequences.

SamStudio8 commented 2 years ago

OK that's done.

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-07", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986

Note that we've left the rejected and suppressed genomes with the original published date because they are unaffected by this problem.

SamStudio8 commented 2 years ago

Looking good

[20220107]$ tail -f publish.log 
20220107
[CPUB] LAST_DATE=2022-01-06
23466 pass.fasta.ls
23466 pass.bam.ls
141 kill.fasta.ls
141 kill.bam.ls
BioWilko commented 2 years ago

I have checked the metadata TSV and big MSA and the missing samples appear to have been included in the latest dataset.