Closed SamStudio8 closed 2 years ago
Posted to #metadata and the usual places:
Access to individual FASTA and individual BAM files
Hi all... is this thing on... :microphone: My first metadata announcement in a long while now, and it's a big one:
Next Monday on 2022-01-31 the "individual FASTA" (bham/artifacts/published/fasta
) and "individual BAM" (bham/artifacts/published/alignment
) directories will be permanently deleted.
Deleting these directories is necessary to solve a critical scaling problem with the file system and will change how users access FASTA and BAM resources. What does this mean for users?
bham/artifacts/published/fasta
to read FASTA files, our advice has long been that you should extract the sequences from the daily consensus FASTA (elan.consensus.fasta
) but we realise that some users may not know that this FASTA file is indexed for efficient random access. For users who are not sure how to efficiently pull out sequences from the daily FASTA (hint: it's not grep
), @Sam Wilkinson (CLIMB) has kindly written a script that can pull out sequences from the daily consensus using a list (or file) of central_sample_id
, run_name
, pag_name
or (central_sample_id, run_name)
-- so should cover almost every possible use case you may have for extracting subsets of sequences. You can find it on our new utilities repo: https://github.com/CLIMB-COVID/utilitiesbham/artifacts/published/alignment
is the location for BAM files and will need to "look up" the location of the BAMs using the lookup table provided at /cephfs/covid/artifacts/elan/latest/majora.pag_lookup.tsv
. This lookup file will be automatically updated by Elan every day.Some mammoth work going on behind the scenes to get this done, including:
FASTA files in /fasta/ with a PAG:
>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()
2205865
Sharded FASTA:
$ shard.py --lookup majora.pag_lookup.tsv --itype consensus --ipath-startswith /cephfs/covid/bham/nicholsz/ --opath /cephfs/covid/artifacts/fasta/ > 20220216.fasta_move.manifest.tsv 2> 20220216.fasta_move.manifest.err
$ wc -l 20220216.fasta_move.manifest.tsv
2205864 20220216.fasta_move.manifest.tsv
An off-by-one so you know it's a legit bioinformatics solution.
FASTA files sharded and Majora updated. An errant PAG was created as a side effect* of #177 and has been removed.
>>> models.DigitalResourceArtifact.objects.filter(current_kind="consensus", primary_group__digitalresourcegroup__current_name="fasta", groups__publishedartifactgroup__id__isnull=False).count()
0
BAM files and indexes have been sharded successfully and Majora fully updated. The PAG lookup integrity is degraded today as I made the decision to not delay the publish step to have it correct as we were already running significantly late (#189).
I have now checked for orphaned FASTA and BAM in Majora; discovering 130 unpagged BAM+FAS pairs. 78 of them are directly linked to artifacts attached to PAGs that were destroyed in response to https://github.com/COG-UK/dipi-group/issues/91 (and can safely be removed). The remaining artifacts were injected into Majora for the initial migration back in April 2020 and the FASTA files are demonstrably empty -- so can safely be removed from Majora. There are zero discrepancies to investigate.
PAG lookup contains 0 references to .../bham/nicholsz/artifacts
Tracking internally as EO#15