Closed H-EKE closed 3 months ago
Hi @H-EKE,
Currently, pyKVFinder offers thread-level parallelization for a single detection. To improve efficiency for large datasets, you could use the multiprocessing package, which allows you to run multiple detections simultaneously. While I have not tried this myself, I recommend balancing the number of threads (nthreads
) in the pyKVFinder package with the number of processes in the multiprocessing package to optimize performance.
Another way to reduce computational time is by adjusting the grid size for detection. You can use the Box Adjustment Mode to draw a box around the target binding site, which will help in focusing the detection and speeding up the process.
Could you please share the parameters you are using, such as Step, Probe Out, Removal Distance, and Grid size (nx, ny, nz)? Also, how many atoms are in your biomolecule?
For reference, here is a Jupyter notebook demonstrating cavity detection in an MD simulation, using a custom box for grid creation: https://github.com/LBC-LNBio/pyKVFinder/blob/master/examples/md-analysis/md-analysis.ipynb.
Hi @jvsguerra ,
Thanks for the fast answer.
This is the code that im using
# Get all frames of the molecular dynamics simulation
frames = [f for f in sorted(os.listdir('/test_1')) if f.endswith('.pdb')][:-2]
print(frames)
# Get reference protein
reference = 'reference.pdb'
# Define detection parameters
step = 0.4
probe_out = 12.0
# Get vertices of # Get vertices of the whole protein
atomic = pyKVFinder.read_pdb(reference)
vertices = pyKVFinder.get_vertices(atomic, probe_out=probe_out, step=step)
# Define a common custom box using parKVFinder's PyMOL plugin
box = {
'box': {
'p1': vertices[0].tolist(),
'p2': vertices[1].tolist(),
'p3': vertices[2].tolist(),
'p4': vertices[3].tolist()
}
}
# Write common custom box to file
with open('/box.toml', 'w') as f:
toml.dump(box, f)
# Detect and characterize cavities on reference
results = pyKVFinder.run_workflow(reference, box='box.toml', include_depth=True, include_hydropathy=True, ignore_backbone=False)
# Export cavities and results
results.export_all(fn='/test_1/linked.results.toml', output='/test_1/linked_output.pdb', include_frequencies_pdf=True, pdf='test_1/linked.histograms.pdf')
# Create empty array
occurrence = None
for frame in frames:
# Load atomic data
atomic = pyKVFinder.read_pdb(os.path.join('/test_1', frame))
# Get vertices from file
vertices, atomic = pyKVFinder.get_vertices_from_file('box.toml', atomic, probe_out=12.0)
# Detect biomolecular cavities
ncav, cavities = pyKVFinder.detect(atomic, vertices, box_adjustment=True)
if occurrence is None:
occurrence = (cavities > 1).astype(int)
else:
occurrence += (cavities > 1).astype(int)
# Get percentage of occurence
percentage = (occurrence/len(frames)) * 100
# Get cavity points
cavities = ((occurrence > 0).astype('int32'))
cavities += cavities
# Export cavities with percentage of occurrence in B-factor column
pyKVFinder.export('/test_1/occurrence.pdb', cavities, None, vertices, B=percentage)
To optimize performance, you could increase the step
back to 0.6 and use the multiprocessing
package.
Here is a simple example of how to use multiprocessing
with pyKVFinder
:
import pyKVFinder
import multiprocessing
# Parameters for parallelization
NUMBER_OF_THREADS = 3
NUMBER_OF_PROCESSESS = 4
pdbs = [f"examples/md-analysis/data/{index:03d}.pdb" for index in range(1, 601)]
def func(pdb):
return pyKVFinder.run_workflow(pdb, step=0.6, probe_out=12.0, nthreads=NUMBER_OF_THREADS)
results = []
with multiprocessing.Pool(NUMBER_OF_PROCESSESS) as p:
results = p.map(func, pdbs, chunksize=1)
p.close()
p.join()
This script sets up multiprocessing
to run pyKVFinder.run_workflow
on multiple PDB files in parallel, which can speed up the analysis process. It is worth noting that you will need to optimize NUMBER_OF_THREADS
and NUMBER_OF_THREADS
to achieve the best performance possible.
I will try and update you!
Thanks a lot @jvsguerra
Hi,
Thanks for your nice tool!
I was wondering if there is a way to efficiently parallelize the analysis. I would like to run the analysis on 10,000 frames, but it is currently taking a lot of time.
Thanks!