Shard the published artifact directories

SamStudio8 commented 2 years ago

Tracking internally as EO#15

SamStudio8 commented 2 years ago

Posted to #metadata and the usual places:

Access to individual FASTA and individual BAM files Hi all... is this thing on... :microphone: My first metadata announcement in a long while now, and it's a big one: Next Monday on 2022-01-31 the "individual FASTA" (bham/artifacts/published/fasta) and "individual BAM" (bham/artifacts/published/alignment) directories will be permanently deleted.

Deleting these directories is necessary to solve a critical scaling problem with the file system and will change how users access FASTA and BAM resources. What does this mean for users?

FASTA users: If you access bham/artifacts/published/fasta to read FASTA files, our advice has long been that you should extract the sequences from the daily consensus FASTA (elan.consensus.fasta) but we realise that some users may not know that this FASTA file is indexed for efficient random access. For users who are not sure how to efficiently pull out sequences from the daily FASTA (hint: it's not grep), @Sam Wilkinson (CLIMB) has kindly written a script that can pull out sequences from the daily consensus using a list (or file) of central_sample_id, run_name, pag_name or (central_sample_id, run_name) -- so should cover almost every possible use case you may have for extracting subsets of sequences. You can find it on our new utilities repo: https://github.com/CLIMB-COVID/utilities
BAM users: You will no longer be able to assume bham/artifacts/published/alignment is the location for BAM files and will need to "look up" the location of the BAMs using the lookup table provided at /cephfs/covid/artifacts/elan/latest/majora.pag_lookup.tsv. This lookup file will be automatically updated by Elan every day.

SamStudio8 commented 2 years ago

Some mammoth work going on behind the scenes to get this done, including:

Ensure cog-publish can reconcile without assumed publish directory https://github.com/SamStudio8/elan-nextflow/commit/f1353e1abf89621d3e19fdd42db371d68db91bb7
Elan 2022-01-26 to move FASTA to new location https://github.com/CLIMB-COVID/elan-nextflow/commit/c1cb945a7f016cbe4e5449d0083d59258c52601a
Majora emits suppression status for pagfiles https://github.com/SamStudio8/majora/commit/8691a2d49af6e9e0d8ea4a84e2b9f9841c0b73be
Ocarina understands suppression status https://github.com/SamStudio8/ocarina/commit/c57af2f378a87656c35c53f45a65dda1ffcb81e4#diff-32a6c866b8261c05a2e747287c275529fb77168a202d077e53a665efb86f35be
Elan writes nice big lookup table in cog-publish https://github.com/SamStudio8/elan-nextflow/commit/67d0bb4cb226c9283f670da32de24439ff38b800

SamStudio8 commented 2 years ago

FASTA files in /fasta/ with a PAG:

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()
2205865

Sharded FASTA:

$ shard.py --lookup majora.pag_lookup.tsv --itype consensus --ipath-startswith /cephfs/covid/bham/nicholsz/ --opath /cephfs/covid/artifacts/fasta/ > 20220216.fasta_move.manifest.tsv 2> 20220216.fasta_move.manifest.err
$ wc -l 20220216.fasta_move.manifest.tsv
2205864 20220216.fasta_move.manifest.tsv

An off-by-one so you know it's a legit bioinformatics solution.

SamStudio8 commented 2 years ago

FASTA files sharded and Majora updated. An errant PAG was created as a side effect* of #177 and has been removed.

>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()                                                     
0

SamStudio8 commented 2 years ago

BAM files and indexes have been sharded successfully and Majora fully updated. The PAG lookup integrity is degraded today as I made the decision to not delay the publish step to have it correct as we were already running significantly late (#189).

I have now checked for orphaned FASTA and BAM in Majora; discovering 130 unpagged BAM+FAS pairs. 78 of them are directly linked to artifacts attached to PAGs that were destroyed in response to https://github.com/COG-UK/dipi-group/issues/91 (and can safely be removed). The remaining artifacts were injected into Majora for the initial migration back in April 2020 and the FASTA files are demonstrably empty -- so can safely be removed from Majora. There are zero discrepancies to investigate.

SamStudio8 commented 2 years ago

PAG lookup contains 0 references to .../bham/nicholsz/artifacts

COG-UK / dipi-group

Shard the published artifact directories #164