Parallelize jobs - Githubissues

H-EKE commented 3 months ago

Hi,

Thanks for your nice tool!

I was wondering if there is a way to efficiently parallelize the analysis. I would like to run the analysis on 10,000 frames, but it is currently taking a lot of time.

Thanks!

jvsguerra commented 3 months ago

Hi @H-EKE,

Currently, pyKVFinder offers thread-level parallelization for a single detection. To improve efficiency for large datasets, you could use the multiprocessing package, which allows you to run multiple detections simultaneously. While I have not tried this myself, I recommend balancing the number of threads (nthreads) in the pyKVFinder package with the number of processes in the multiprocessing package to optimize performance.

Another way to reduce computational time is by adjusting the grid size for detection. You can use the Box Adjustment Mode to draw a box around the target binding site, which will help in focusing the detection and speeding up the process.

Could you please share the parameters you are using, such as Step, Probe Out, Removal Distance, and Grid size (nx, ny, nz)? Also, how many atoms are in your biomolecule?

For reference, here is a Jupyter notebook demonstrating cavity detection in an MD simulation, using a custom box for grid creation: https://github.com/LBC-LNBio/pyKVFinder/blob/master/examples/md-analysis/md-analysis.ipynb.

H-EKE commented 3 months ago

Hi @jvsguerra ,

Thanks for the fast answer.

This is the code that im using

# Get all frames of the molecular dynamics simulation
frames = [f for f in sorted(os.listdir('/test_1')) if f.endswith('.pdb')][:-2]
print(frames)
# Get reference protein
reference = 'reference.pdb'

# Define detection parameters
step = 0.4
probe_out = 12.0

# Get vertices of # Get vertices of the whole protein
atomic = pyKVFinder.read_pdb(reference)
vertices = pyKVFinder.get_vertices(atomic, probe_out=probe_out, step=step)


# Define a common custom box using parKVFinder's PyMOL plugin
box = {
    'box': {
        'p1': vertices[0].tolist(),
        'p2': vertices[1].tolist(),
        'p3': vertices[2].tolist(),
        'p4': vertices[3].tolist()
    }
}

# Write common custom box to file
with open('/box.toml', 'w') as f:
    toml.dump(box, f)


# Detect and characterize cavities on reference
results = pyKVFinder.run_workflow(reference, box='box.toml', include_depth=True, include_hydropathy=True, ignore_backbone=False)

# Export cavities and results
results.export_all(fn='/test_1/linked.results.toml', output='/test_1/linked_output.pdb', include_frequencies_pdf=True, pdf='test_1/linked.histograms.pdf')


# Create empty array
occurrence = None

for frame in frames:
    # Load atomic data
    atomic = pyKVFinder.read_pdb(os.path.join('/test_1', frame))

    # Get vertices from file
    vertices, atomic = pyKVFinder.get_vertices_from_file('box.toml', atomic, probe_out=12.0)

    # Detect biomolecular cavities
    ncav, cavities = pyKVFinder.detect(atomic, vertices, box_adjustment=True)

    if occurrence is None:
        occurrence = (cavities > 1).astype(int)
    else:
        occurrence += (cavities > 1).astype(int)

# Get percentage of occurence
percentage = (occurrence/len(frames)) * 100

# Get cavity points
cavities = ((occurrence > 0).astype('int32'))
cavities += cavities

# Export cavities with percentage of occurrence in B-factor column
pyKVFinder.export('/test_1/occurrence.pdb', cavities, None, vertices, B=percentage)

jvsguerra commented 3 months ago

To optimize performance, you could increase the step back to 0.6 and use the multiprocessing package.

Here is a simple example of how to use multiprocessing with pyKVFinder:

import pyKVFinder
import multiprocessing

# Parameters for parallelization
NUMBER_OF_THREADS = 3
NUMBER_OF_PROCESSESS = 4

pdbs = [f"examples/md-analysis/data/{index:03d}.pdb" for index in range(1, 601)]

def func(pdb):
    return pyKVFinder.run_workflow(pdb, step=0.6, probe_out=12.0, nthreads=NUMBER_OF_THREADS)

results = []
with multiprocessing.Pool(NUMBER_OF_PROCESSESS) as p:
    results = p.map(func, pdbs, chunksize=1)
    p.close()
    p.join()

This script sets up multiprocessing to run pyKVFinder.run_workflow on multiple PDB files in parallel, which can speed up the analysis process. It is worth noting that you will need to optimize NUMBER_OF_THREADS and NUMBER_OF_THREADS to achieve the best performance possible.

H-EKE commented 3 months ago

I will try and update you!

Thanks a lot @jvsguerra

LBC-LNBio / pyKVFinder

Parallelize jobs #116