COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Shard the published artifact directories #164

Closed SamStudio8 closed 2 years ago

SamStudio8 commented 2 years ago

Tracking internally as EO#15

SamStudio8 commented 2 years ago

Posted to #metadata and the usual places:


Access to individual FASTA and individual BAM files Hi all... is this thing on... :microphone: My first metadata announcement in a long while now, and it's a big one: Next Monday on 2022-01-31 the "individual FASTA" (bham/artifacts/published/fasta) and "individual BAM" (bham/artifacts/published/alignment) directories will be permanently deleted.

Deleting these directories is necessary to solve a critical scaling problem with the file system and will change how users access FASTA and BAM resources. What does this mean for users?

SamStudio8 commented 2 years ago

Some mammoth work going on behind the scenes to get this done, including:

SamStudio8 commented 2 years ago

FASTA files in /fasta/ with a PAG:

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()
2205865

Sharded FASTA:

$ shard.py --lookup majora.pag_lookup.tsv --itype consensus --ipath-startswith /cephfs/covid/bham/nicholsz/ --opath /cephfs/covid/artifacts/fasta/ > 20220216.fasta_move.manifest.tsv 2> 20220216.fasta_move.manifest.err
$ wc -l 20220216.fasta_move.manifest.tsv
2205864 20220216.fasta_move.manifest.tsv

An off-by-one so you know it's a legit bioinformatics solution.

SamStudio8 commented 2 years ago

FASTA files sharded and Majora updated. An errant PAG was created as a side effect* of #177 and has been removed.

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()                                                     
0
SamStudio8 commented 2 years ago

BAM files and indexes have been sharded successfully and Majora fully updated. The PAG lookup integrity is degraded today as I made the decision to not delay the publish step to have it correct as we were already running significantly late (#189).

I have now checked for orphaned FASTA and BAM in Majora; discovering 130 unpagged BAM+FAS pairs. 78 of them are directly linked to artifacts attached to PAGs that were destroyed in response to https://github.com/COG-UK/dipi-group/issues/91 (and can safely be removed). The remaining artifacts were injected into Majora for the initial migration back in April 2020 and the FASTA files are demonstrably empty -- so can safely be removed from Majora. There are zero discrepancies to investigate.

SamStudio8 commented 2 years ago

PAG lookup contains 0 references to .../bham/nicholsz/artifacts