COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Shard the published artifact directories #164

Closed SamStudio8 closed 2 years ago

SamStudio8 commented 2 years ago

Tracking internally as EO#15

SamStudio8 commented 2 years ago

Posted to #metadata and the usual places:

Access to individual FASTA and individual BAM files Hi all... is this thing on... :microphone: My first metadata announcement in a long while now, and it's a big one: Next Monday on 2022-01-31 the "individual FASTA" (bham/artifacts/published/fasta) and "individual BAM" (bham/artifacts/published/alignment) directories will be permanently deleted.

Deleting these directories is necessary to solve a critical scaling problem with the file system and will change how users access FASTA and BAM resources. What does this mean for users?

SamStudio8 commented 2 years ago

Some mammoth work going on behind the scenes to get this done, including:

SamStudio8 commented 2 years ago

FASTA files in /fasta/ with a PAG:

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()

Sharded FASTA:

$ --lookup majora.pag_lookup.tsv --itype consensus --ipath-startswith /cephfs/covid/bham/nicholsz/ --opath /cephfs/covid/artifacts/fasta/ > 20220216.fasta_move.manifest.tsv 2> 20220216.fasta_move.manifest.err
$ wc -l 20220216.fasta_move.manifest.tsv
2205864 20220216.fasta_move.manifest.tsv

An off-by-one so you know it's a legit bioinformatics solution.

SamStudio8 commented 2 years ago

FASTA files sharded and Majora updated. An errant PAG was created as a side effect* of #177 and has been removed.

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()                                                     
SamStudio8 commented 2 years ago

BAM files and indexes have been sharded successfully and Majora fully updated. The PAG lookup integrity is degraded today as I made the decision to not delay the publish step to have it correct as we were already running significantly late (#189).

I have now checked for orphaned FASTA and BAM in Majora; discovering 130 unpagged BAM+FAS pairs. 78 of them are directly linked to artifacts attached to PAGs that were destroyed in response to (and can safely be removed). The remaining artifacts were injected into Majora for the initial migration back in April 2020 and the FASTA files are demonstrably empty -- so can safely be removed from Majora. There are zero discrepancies to investigate.

SamStudio8 commented 2 years ago

PAG lookup contains 0 references to .../bham/nicholsz/artifacts