COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

[asklepian] Compress outputs #37

Closed SamStudio8 closed 3 years ago

SamStudio8 commented 3 years ago
SamStudio8 commented 3 years ago

Checking whether collection_pillar is strictly necessary. I think it may be acting as a proxy here for something else, ideally that information would come from PHA as it's more likely to be correct.

NG confirms this is the field they want. Will proceed to that spec.

SamStudio8 commented 3 years ago

make_genomes_table.py is responsible for pulling the metadata from the core metadata file, and zipping it to the genomes. adm1 and collection_pillar are trivial, however published_date is not in scope. Ideally we'd achieve this request with a Majora dataview, but they are still too slow at whole-dataset scale (https://github.com/SamStudio8/majora/issues/27).

Given the amount of work required to address the performance constraints of Majora's MDV API endpoint, we'll need to tide this over with something. Easiest solution will be to pull all pairs of published_name and published_date and mix these in to the genome table. Ideally long term, everything would move to a faster version of the MDV endpoint.

SamStudio8 commented 3 years ago

As it happens, the get pag endpoint used to kick-off Asklepian should have all the metadata in scope -- this would be a good stepping stone towards my ideal solution: we'd cut out the core metadata table and leverage the API instead. In a parallel universe where I have time to re architect the MDV API, it would be quite easy to switch get pag over to use it.

SamStudio8 commented 3 years ago

Suspicion confirmed, the get pag API has everything we need.

SamStudio8 commented 3 years ago

https://github.com/SamStudio8/asklepian/commit/db3387b10b13ccd5b7a94969d7cbdbc6ef3b616b adds an updated genome table script that will push a test_v2 copy of the genome table until we are ready to switch over.

SamStudio8 commented 3 years ago

Changes deployed ready for tomorrow's Asklepian.

SamStudio8 commented 3 years ago

As per discussion with DG, compressing v2 genomics table as of today https://github.com/COG-UK/dipi-group/issues/43

SamStudio8 commented 3 years ago

CJ's team has picked this up now. Hopefully we can make the switch soon.

SamStudio8 commented 3 years ago

https://github.com/SamStudio8/asklepian/commit/71ca4555f867e71b4957f4b33047358cdb9e9672 deprecates the v1 genome table, and removes the test_ prefix from the v2 table. v1 genome table will not be generated 2021-04-21.

The v2 genomes table will not be automatically deleted (as usual) in case we need to resend or drop the columns to create the v1 table for whatever reason. Once we're happy we can return to deleting it as usual.

SamStudio8 commented 3 years ago

Ingest failed on the other side, engineers investigating. See JIRA ~EDGE-2004~, ~EDGE-2152~ DA-7013.

SamStudio8 commented 3 years ago

Chased this up with the engineers on the other side. Appears the compression was not taken into account? Regardless, issue with ingest appears to be resolved now.

SamStudio8 commented 3 years ago

NG confirms the genomes table ingested has the Sequence field as MSA (#61) so the ingest must be the latest data, hooray! :tada: :parrot:

SamStudio8 commented 3 years ago

Going to chase compression on the variant table up this week to try and close this.

SamStudio8 commented 3 years ago

Moving this to backlog #62 as the change process on the other side is moving so slow

SamStudio8 commented 3 years ago

Discussed this with CG and have agreed to compress the variant table starting with tomorrow's run (20210604). I will add the gzip step to the Asklepian go.sh after today's (20210603) run has completed and notify CG. CG will update the ingest pipeline on their end to expect a gzipped input (like the genome table) after the 20210603 ingest completes later today. We will monitor the pipeline closely tomorrow to ensure continuity.

SamStudio8 commented 3 years ago

Change implemented by https://github.com/SamStudio8/asklepian/commit/064817ed695237ae3dc63c026f345f535ba96038. Output filename will now be suffixed with .gz: variant_table_$DATESTAMP.csv.gz. CG notified and acknowledged.

SamStudio8 commented 3 years ago

CG confirms partner change has been performed on their side. Green light for tomorrow :rocket:

SamStudio8 commented 3 years ago

Compressed variant table written and sent. Variant table step was around 15 minutes faster compared to yesterday, and the compression ratio in the new CSV is around 7x.

SamStudio8 commented 3 years ago

Reinflated CSV is where it is supposed to be on CLIMB-COVID, downstream asklepian-db step has run successfully. Spoken to CG on the other end and the gzipped variant table is processing on the other side! :fire: :rocket:

SamStudio8 commented 3 years ago

So apparently gzip files and Apache Spark are not friends (http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=RGkJacJAW3tTOQQA@mail.gmail.com%3E) and this change has caused performance trouble on the other side, as Spark is not able to split up the input for efficiency. That link mentions Snappy compressed files are splittable so we can try that.

SamStudio8 commented 3 years ago

We're rolling this change back for the weekend and will experiment with Snappy compression (or alternatives) next week.

SamStudio8 commented 3 years ago

Reverted our side by https://github.com/SamStudio8/asklepian/commit/e4511de3267c337b3840e1f88b6f95ed59b0d3a9

SamStudio8 commented 3 years ago

Reverted by CG on the other side

SamStudio8 commented 3 years ago

Installed python-snappy as it has a module that binds the snappy library to use easily from the CLI because obviously snappy is so hipster they can't possibly just distribute a binary. Sent over test_variant_table_20210604.csv.snappy just to try out.

SamStudio8 commented 3 years ago

As a quick sanity check the cat unsnappy | python -m snappy -c > snappy to python -m snappy -d snappy > unsnappy round trip does give us the same file back.

SamStudio8 commented 3 years ago

Naturally that file did not work because of course there are different poorly documented codecs for snappy. Sent a replacement generated with -t hadoop_snappy which is more likely to work according to some stranger on stackoverflow.

SamStudio8 commented 3 years ago

CG reports the hadoop_snappy file was not splittable either. This SO article (https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable) conflicts yesterday's reading and says that whole files compressed with Snappy won't be splittable after all. Given this was supposed to be a sticky plaster before we could get to implement the incremental tables I don't want to spend too long delving into wtf is going on here, and I didn't really like the half finished look of Snappy anyway.

Interestingly another SO article (https://stackoverflow.com/a/25888475/2576437) mentions bzip2 and LZ4 (via https://github.com/fingltd/4mc) are supposed to be splittable and those are totally normal compression algorithms.

SamStudio8 commented 3 years ago

CG confirms bzip2 is splittable :tada: Problematically it also seems to be the slowest compression option we've tried. SN will do a couple of naive time tests to see what the impact of swapping to bzip2 would be. It may be that a small compression time penalty on the CLIMB side to speed up the PHA Spark side will be the best compromise.

SamStudio8 commented 3 years ago

The genome table example for bzip2 has been running for significantly longer than gzip now. From the bzip2 manual (below) it would seem that the genomic strings are quite likely the worst case input for compression.

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal. Versions 0.9.5 and above fare much better than previous versions in this respect. The ratio between worst-case and average-case compression time is in the region of 10:1.

My suggestion is we will continue to gzip the genomic table for transfer to PHE. Even though the PHE ingest will be unsplit, it remains reasonably fast and stable (the table grows linearly). We save precious time and I/O from having the table compressed at source this way.

I'll do some variant table tests when I get the final wall time of the bzip2 test.

SamStudio8 commented 3 years ago

Genome table takes 22m to process and gzip, 112m to process and bgzip2. Will try variant table now.

SamStudio8 commented 3 years ago

79m to process and gzip the variant table, 88m to process and bzip2. The 10m delay on this side is certainly worth the penalty given there is an order of magnitude (or so) difference in processing the variant table on the other side as a splittable format (or not). Will discuss with CG.

SamStudio8 commented 3 years ago

Closing due to lack of interest