COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Asklepian performance #195

Closed SamStudio8 closed 2 years ago

SamStudio8 commented 2 years ago

Once again, we must consider a patch to Asklepian to improve performance. Since introducing gofasta several months ago (#117) "for great gains", the linear performance in generating the genome and variant tables has slowly but surely grown unwieldy. Previous discussions on using delta tables for EDGE went cold as the engineers we were closely working with were moved off the project.

It seems clear to me now that there is no likelihood of implementing delta tables to send to EDGE and we are left with two options:

I have previously thought about how we could perform internal deltas and laid some initial ground work for this in a previous update. The "best ref" step emits whether or not the best reference for a COG-ID has changed since the last run of Asklepian; providing an easy means to determine whether new work should be done for a given COG. However, from experience we know that these internal caches introduce more moving parts to look after and we are trying to limit the cost of maintaining the system going forward.

Having not exhausted the potential of throwing more compute cycles at the problem, it seems obvious to try and see what gains can be made with multiprocessing. Watching Asklepian this afternoon it is clear from the speed at which the table files grow on disk that we are not limited by cephfs IO but CPU.

SamStudio8 commented 2 years ago

HE'S DONE IT AGAIN :rocket: https://github.com/CLIMB-COVID/asklepian/commit/ab5faffd564512c357484122103a8d57a9471134

SamStudio8 commented 2 years ago

unfortunately this patch is so fast it is now quicker than the genome table, so we'll have to just speed that up as well

SamStudio8 commented 2 years ago

wrt the genome table: Amusingly, way back as part of #37, we started compressing the genome table as it took a "considerable" amount of time to transfer it to EDGE in comparison to generating the table in the first place. A year later, and now the size of the data set has shifted the bottleneck to the compression.

We are somewhat stuck with this one. We won't be able to liaise with EDGE to rectify this nicely as the expertise to do it is no longer in the project. We cannot switch the compression algorithm, nor use something like pigz (as it is potentially unsupported by the endpoint). Our only real option is to forgo compression performance in favour of speed with gzip --fast.

It takes just over 8 minutes to write the genome table uncompressed for the Elan 20220228 data set. Feeding it to gzip increases the table processing time to nearly two hours! However, using gzip --fast will construct the genome table in 40m but doubles the output file size...

Today's genome table was constructed in 1'55" and uploaded in 14" for a total of 129 minutes. The gzip --fast table was constructed in 40" and uploaded in 32" for a total of 72 minutes. Using gzip --fast is the best trade-off we will be able to achieve in the circumstances. I anticipate that the compression will remain the larger factor given the size of the data set now. The 20220228 variant table is twice as large as the --fast genome table and still transfers in an hour. A change in compression level is not expected to change the decompression time on the EDGE side (https://stackoverflow.com/questions/28452429/does-gzip-compression-level-have-any-impact-on-decompression).

Even though this may not be the ideal solution, it is trivial to implement and will ensure that the variant table remains the Asklepian bottleneck; as Asklepian's run time is the maximum of the genome table and variant table. The variant table after the most recent update still takes around 90 minutes to construct and upload (uploading is now the bottleneck again in the case of the variant table). This change will buy more than enough time for the time being, at least.

The ideal solution in this case would be to send only genomes that have changed to EDGE as previously discussed, but that is ~unlikely~ almost certainly not a viable option now.

SamStudio8 commented 2 years ago

Asklepian finishes in around half the time as before which is a pretty decent win in my book. The remainder of the time is split roughly evenly between waiting for the Majora manifest; building the MSA; and building/uploading tables. The only speed gain now will be proper implementations of deltas.