gamcil / clinker

Gene cluster comparison figure generator
MIT License
518 stars 69 forks source link

Implement parallel computing of alignments #22

Closed althonos closed 3 years ago

althonos commented 3 years ago

Hi Cameron,

first of all a big thank for this toolkit, building and rendering alignments between our biosynthetic genes clusters have never been that easy, so I felt like I had to give back by contributing a little something :smile:

Since the alignment time greatly increases with each new cluster to align, I added support for parallel computing of the alignment step. To do so, I removed the aligner from Globaligner, instead just storing the aligner configuration, and then creating a new one each time Globaligner.align_clusters is called. This is actually not affecting performance (the initialisation code for Bio.Align.Aligner is not doing expensive operations).

Because Bio.Align is not releasing the GIL, the only way to have true parallelism is to use processes instead of threads. This means the objects must be picklable, which is the case. However, to avoid pickling the entirety of the Globaligner, I moved the inner code of the Globaligner.align_clusters function into a static method that does not reference the aligner itself; this way, the function can be passed to a process by just pickling the aligner configuration, and not the Globaligner itself.

Finally, I added a -j / --jobs flag to the argument parser, so that the number of processes can be controlled from the CLI. Passing 0 will use the CPU count, passing any positive integers will use that number of processes instead.

On my computer (i7-8550U CPU @ 1.80GHz, with 8 CPUs, 4 cores), aligning the GenBank files in the examples folder was about 3.7 times faster:

Benchmark #1: python -m clinker.main examples/*.gbk
  Time (mean ± σ):      4.135 s ±  0.207 s    [User: 22.251 s, System: 1.138 s]
  Range (min … max):    3.744 s …  4.490 s    10 runs

Benchmark #2: clinker examples/*.gbk
  Time (mean ± σ):     15.210 s ±  0.432 s    [User: 15.261 s, System: 0.707 s]
  Range (min … max):   14.425 s … 15.970 s    10 runs

Summary
  'python -m clinker.main examples/*.gbk' ran
    3.68 ± 0.21 times faster than 'clinker examples/*.gbk'

Don't hesitate to ping me if you have any question!

gamcil commented 3 years ago

This is awesome, thanks! Should partially address #21 too.

Just one thing from my testing - it looks like BioPython's subsitution_matrices.Array object loses its _alphabet property during pickling. It still gets loaded by the aligner within each process and doesn't cause alignments to fail, but causes cluster alignments to be incomplete (gene alignments have really low scores and aren't saved). For example:

import pickle
from Bio.Align import substitution_matrices
matrix = substitution_matrices.load("BLOSUM62")
pickled = pickle.dumps(matrix)
loaded = pickle.loads(pickled)
print(loaded.alphabet)

Causes:

AttributeError: 'Array' object has no attribute '_alphabet'

This is fixed if I load the matrix within Globaligner._align_clusters with substitution_matrices.load("BLOSUM62"). Can you confirm on your end? If yes, maybe it's best to just save the subsitution matrix name as a string within Globaligner.aligner_config, and then add a specific check for the key to load a matrix in Globaligner._align_clusters.

althonos commented 3 years ago

Good catch ! I'll try to find a way to fix it. Maybe it's an issue that should be reported to Biopython devs as well, since the Array class is still experimental.

gamcil commented 3 years ago

Cool, thanks for filing the bug report. I was still getting some weird behaviour in alignments with your workaround, so I added a commit to just keep a matrix name in the config and load the substitution matrix directly within _align_clusters instead. This means we're restricted to only BioPython pre-made matrices, but they have a pretty extensive list of them so it should be fine. I might revisit this when https://github.com/biopython/biopython/pull/3442 lands in a new BioPython release. Thanks a lot for this PR!

althonos commented 3 years ago

Cool! I wanted to keep the configuration exactly as it is for the Biopython aligner, but if it doesn't work your approach is the safest. Cheers :smiley: