CATH-summer-2017 / domchop

2 stars 3 forks source link

Parallelisation #7

Open shouldsee opened 7 years ago

shouldsee commented 7 years ago

test_pdb.py currently uses only 1 thread to do its job. I am looking to parallelize it so that I can speed things up on a mulit-core machine. This routine will be generally useful for other intensive purposes in the future.

For example, it took 31m38.676s (1898s) to calculate nDOPE for 3393 pdb structures. The S35 set has 21155 structures, meaning running over the set would take 11833s (3.29 hrs). If we parallelise it with 6 cores, this could reduce to 0.54 hrs.

As suggest by Tony, I profiled the code to check Modeller is indeed taking up the most time.

Timer unit: 1e-06 s

Total time: 12.4703 s
File: <ipython-input-27-19c264d8bf47>
Function: main at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def main():
     3         1            2      2.0      0.0          wait = 0;
     4         1            1      1.0      0.0          waitname = "3p9dG03";
     5         1            1      1.0      0.0          reset = 0;
     6                                           
     7         1            1      1.0      0.0          if reset:
     8                                                       open("ref_DOPEs.csv","w").close()
     9                                           
    10         1            4      4.0      0.0          import csv
    11                                           
    12         1           46     46.0      0.0          with open("ref_DOPEs.csv", "a") as f:
    13         1           12     12.0      0.0              c = csv.writer(f)
    14                                           
    15        14           53      3.8      0.0              for pdbfile in onlyfiles:
    16        13          116      8.9      0.0                  pdbname = os.path.basename(pdbfile);
    17        13          101      7.8      0.0                  if pdbfile.split(".")[-1] in ["bak"] or wait:
    18                                                               # onlyfiles.pop(pdbfile);
    19         1            1      1.0      0.0                      if pdbname == waitname:
    20                                                                   wait = 0;
    21                                                               continue
    22        12         8685    723.8      0.1                  print("\n\n//Testing structure from %s" % pdbfile)
    23        12           21      1.8      0.0                  try:
    24        12     12459657 1038304.8     99.9                      nDOPE = get_nDOPE( join(pdbfile), env = env)
    25         1            3      3.0      0.0                  except:
    26         1          146    146.0      0.0                      print("can't process", pdbfile)
    27                                           #                     break
    28                                                               
    29        12           46      3.8      0.0                  nDOPEs.append( nDOPE );
    30        12           16      1.3      0.0                  tested_files.append( pdbname );
    31        12          328     27.3      0.0                  c.writerow( [pdbname, nDOPE] )
    32        12         1046     87.2      0.0                  f.flush()