Closed SamStudio8 closed 3 years ago
Checking whether collection_pillar
is strictly necessary. I think it may be acting as a proxy here for something else, ideally that information would come from PHA as it's more likely to be correct.
NG confirms this is the field they want. Will proceed to that spec.
make_genomes_table.py is responsible for pulling the metadata from the core metadata file, and zipping it to the genomes. adm1
and collection_pillar
are trivial, however published_date
is not in scope. Ideally we'd achieve this request with a Majora dataview, but they are still too slow at whole-dataset scale (https://github.com/SamStudio8/majora/issues/27).
Given the amount of work required to address the performance constraints of Majora's MDV API endpoint, we'll need to tide this over with something. Easiest solution will be to pull all pairs of published_name
and published_date
and mix these in to the genome table. Ideally long term, everything would move to a faster version of the MDV endpoint.
As it happens, the get pag
endpoint used to kick-off Asklepian should have all the metadata in scope -- this would be a good stepping stone towards my ideal solution: we'd cut out the core metadata table and leverage the API instead. In a parallel universe where I have time to re architect the MDV API, it would be quite easy to switch get pag
over to use it.
Suspicion confirmed, the get pag
API has everything we need.
https://github.com/SamStudio8/asklepian/commit/db3387b10b13ccd5b7a94969d7cbdbc6ef3b616b adds an updated genome table script that will push a test_v2
copy of the genome table until we are ready to switch over.
Changes deployed ready for tomorrow's Asklepian.
confirmed 20210330@1800
test_v2
table to be default output and remove v1
script - 20210420
As per discussion with DG, compressing v2 genomics table as of today https://github.com/COG-UK/dipi-group/issues/43
CJ's team has picked this up now. Hopefully we can make the switch soon.
https://github.com/SamStudio8/asklepian/commit/71ca4555f867e71b4957f4b33047358cdb9e9672 deprecates the v1 genome table, and removes the test_ prefix from the v2 table. v1 genome table will not be generated 2021-04-21.
The v2 genomes table will not be automatically deleted (as usual) in case we need to resend or drop the columns to create the v1 table for whatever reason. Once we're happy we can return to deleting it as usual.
20210423
Ingest failed on the other side, engineers investigating. See JIRA ~EDGE-2004~, ~EDGE-2152~ DA-7013.
Chased this up with the engineers on the other side. Appears the compression was not taken into account? Regardless, issue with ingest appears to be resolved now.
NG confirms the genomes table ingested has the Sequence
field as MSA (#61) so the ingest must be the latest data, hooray! :tada: :parrot:
Going to chase compression on the variant table up this week to try and close this.
Moving this to backlog #62 as the change process on the other side is moving so slow
Discussed this with CG and have agreed to compress the variant table starting with tomorrow's run (20210604). I will add the gzip
step to the Asklepian go.sh
after today's (20210603) run has completed and notify CG. CG will update the ingest pipeline on their end to expect a gzipped input (like the genome table) after the 20210603 ingest completes later today. We will monitor the pipeline closely tomorrow to ensure continuity.
Change implemented by https://github.com/SamStudio8/asklepian/commit/064817ed695237ae3dc63c026f345f535ba96038. Output filename will now be suffixed with .gz
: variant_table_$DATESTAMP.csv.gz
. CG notified and acknowledged.
CG confirms partner change has been performed on their side. Green light for tomorrow :rocket:
Compressed variant table written and sent. Variant table step was around 15 minutes faster compared to yesterday, and the compression ratio in the new CSV is around 7x.
Reinflated CSV is where it is supposed to be on CLIMB-COVID, downstream asklepian-db
step has run successfully. Spoken to CG on the other end and the gzipped variant table is processing on the other side! :fire: :rocket:
So apparently gzip files and Apache Spark are not friends (http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=RGkJacJAW3tTOQQA@mail.gmail.com%3E) and this change has caused performance trouble on the other side, as Spark is not able to split up the input for efficiency. That link mentions Snappy compressed files are splittable so we can try that.
We're rolling this change back for the weekend and will experiment with Snappy compression (or alternatives) next week.
Reverted by CG on the other side
Installed python-snappy
as it has a module that binds the snappy library to use easily from the CLI because obviously snappy is so hipster they can't possibly just distribute a binary. Sent over test_variant_table_20210604.csv.snappy
just to try out.
As a quick sanity check the cat unsnappy | python -m snappy -c > snappy
to python -m snappy -d snappy > unsnappy
round trip does give us the same file back.
Naturally that file did not work because of course there are different poorly documented codecs for snappy. Sent a replacement generated with -t hadoop_snappy
which is more likely to work according to some stranger on stackoverflow.
CG reports the hadoop_snappy
file was not splittable either. This SO article (https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable) conflicts yesterday's reading and says that whole files compressed with Snappy won't be splittable after all. Given this was supposed to be a sticky plaster before we could get to implement the incremental tables I don't want to spend too long delving into wtf is going on here, and I didn't really like the half finished look of Snappy anyway.
Interestingly another SO article (https://stackoverflow.com/a/25888475/2576437) mentions bzip2 and LZ4 (via https://github.com/fingltd/4mc) are supposed to be splittable and those are totally normal compression algorithms.
CG confirms bzip2 is splittable :tada: Problematically it also seems to be the slowest compression option we've tried. SN will do a couple of naive time
tests to see what the impact of swapping to bzip2
would be. It may be that a small compression time penalty on the CLIMB side to speed up the PHA Spark side will be the best compromise.
The genome table example for bzip2
has been running for significantly longer than gzip
now. From the bzip2
manual (below) it would seem that the genomic strings are quite likely the worst case input for compression.
The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal. Versions 0.9.5 and above fare much better than previous versions in this respect. The ratio between worst-case and average-case compression time is in the region of 10:1.
My suggestion is we will continue to gzip
the genomic table for transfer to PHE. Even though the PHE ingest will be unsplit, it remains reasonably fast and stable (the table grows linearly). We save precious time and I/O from having the table compressed at source this way.
I'll do some variant table tests when I get the final wall time of the bzip2 test.
Genome table takes 22m to process and gzip
, 112m to process and bgzip2
. Will try variant table now.
79m to process and gzip
the variant table, 88m to process and bzip2
. The 10m delay on this side is certainly worth the penalty given there is an order of magnitude (or so) difference in processing the variant table on the other side as a splittable format (or not). Will discuss with CG.
Closing due to lack of interest
adm1
published_date
collection_pillar