ccsb-scripps / AutoDock-GPU

AutoDock for GPUs and other accelerators
https://autodock.scripps.edu
GNU General Public License v2.0
408 stars 110 forks source link

Scripts to process the results of batch docking #115

Closed alchemist-coder closed 3 years ago

alchemist-coder commented 3 years ago

Hello,

I can't find information regarding the Autodock-GPU batched ligand pipeline.

Are there example scripts available to summarize and sort the results of batch docking? Or generate the -filelist batch file from a tree of ligand containing directories?

Should it still be possible to use the summarize_results4.py script provided in the Autodock 4 VSTutorial? I noticed the .dlg files resulting from screening with Autodock 4 and batch submission with Autodock-GPU appear somewhat different.

Thanks

diogomart commented 3 years ago

Hello,

Are there example scripts available to summarize and sort the results of batch docking?

No, maybe in the future.

Or generate the -filelist batch file from a tree of ligand containing directories?

Also no, maybe in the future.

Should it still be possible to use the summarize_results4.py script provided in the Autodock 4 VSTutorial?

That's a good question. We haven't tested it. We'll get back here at some point. We still have work to do on the code, work on the accessory scripts will follow.

Best regards,

pgoverna commented 3 years ago

Hi, one possibility is to use the generate_vs_results_AD.py script in the /AutoDockTools/Utilities24/ folder provided with mgltools. You just need to use the -A option which disables the checkSuccess function. This is necessary as the dlg files generated by AutoDock-GPU are slightly different from the ones from AutoDock 4 and the script won't work without the -A option. Best regards

abazabaaa commented 3 years ago

I wrote a number of scripts to do these things. I can't exactly write one that is general.. as it depends on how you choose to organize your results and how you want to run your pipeline. I've used this to extract the results of a campaign that came from docking 20M+ ligands into a receptor. It is robust. I can try to explain if you have questions, but im not quite ready to release the whole workflow yet. The code snippets below have the necessary pieces for carrying out three tasks in python:

1) find the top scoring pose in the XML file. 2) extract the top scoring pose from the DLG file. 3) index the PDB you get following an openbabel job for PDBQT->PDB.


def prep_docking_results(run_dir, stage_dir, output_db_ext):
    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

    for l in data['canonical_ids']:
        assigned_batch_key = (pathlib.Path(stage_dir).stem+'_batch')
        job_assigned_id = data['canonical_ids'][l]['job_assigned_id']
        output_dir = os.path.join(stage_dir, 'output')
        xml_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}.xml')
        dlg_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}.dlg')
        # print(dlg_path)
        try:
            tree = ET.parse(xml_path)
            root = tree.getroot()
            docking_score = [x.get('mean_binding_energy') for x in root.findall(".//*[@cluster_rank='1']")]
            top_docking_run = [y.get('run') for y in root.findall(".//*[@cluster_rank='1']")]
            # print(docking_score)
            # print(top_docking_run)
            data['canonical_ids'][l].update({'mean_binding_energy':docking_score, 'top_docking_run':top_docking_run, 'combined_docking_poses':dlg_path})
        except FileNotFoundError:
            print(f'There was an error with prep_docking_results')
            print(f'The target xml_path was: {xml_path}')
            print("Wrong file or file path")

    with open(dict_path, 'w', encoding='utf-8') as jsonf:
        jsonf.write(json.dumps(data, indent=4))
def extract_top_autodock_pose(run_dir, stage_dir):     

    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

    for l in data['canonical_ids']:
        try:
            file_name = pathlib.Path(data['canonical_ids'][l]['combined_docking_poses'])
            target_run_list = data['canonical_ids'][l]['top_docking_run']
            target_run = target_run_list[0]
            file = open(file_name, "r")
            lines = file.readlines()
            target_dir = str(file_name.parent)

            starting_line = 0
            ending_line = 0
            # print(rf'Run:   {target_run} / 10')
            for line in lines:
                if line.startswith(rf'Run:   {target_run} / 10'):
                    starting_line = lines.index(line)
                    break

            if target_run != 10:
                # print('the target was not 10')
                for line in lines[starting_line:]:
                    if line.startswith(rf'Run:   {(int(target_run) + 1)} / 10'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + ".pdbqt"
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 9)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            if target_run == 10:
                # print('the target was 10')
                for line in lines[starting_line:]:
                    if line.startswith('    CLUSTERING HISTOGRAM'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + ".pdbqt"
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 5)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            clean_pdbqt_content = []

            for line in stripped_pdbqt_content:
                cleaned_line = line[max(line.find('D'), 8):]
                if not cleaned_line.startswith('USER'):
                    clean_pdbqt_content.append(cleaned_line)

            with open(pdbqt_name, 'w') as f:
                for line_item in clean_pdbqt_content:
                    f.write("%s\n" % line_item)
        except KeyError:
            print(f'There was a key error for extract_top_autodock_pose')
            print(f'The target dlg path was : {file_name}')
            pass
def index_3d_pose(run_dir, stage_dir, output_db_ext):
    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

        for l in data['canonical_ids']:
            try:
                output_dir = os.path.join(stage_dir, 'output')
                assigned_batch_key = (pathlib.Path(stage_dir).stem+'_batch')
                job_assigned_id = data['canonical_ids'][l]['job_assigned_id']
                pdb_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}')
                data['canonical_ids'][l].update({'final_docking_pose':pdb_path})
            except KeyError:
                pass

    with open(dict_path, 'w', encoding='utf-8') as jsonf:
        jsonf.write(json.dumps(data, indent=4))
abazabaaa commented 3 years ago

Here is a more generally useful script for extracting n runs from n dlg files:

import glob
import os
import pathlib

def extract_top_autodock_pose(target_dir, num_runs):

    file_ext = ".dlg"
    dlg_files_list = glob.glob(os.path.join(target_dir, "*"+file_ext))

    for l in dlg_files_list:

        print(f'opening dlg file {l}....')

        for target_run in range(1, num_runs+1):

            file_name = pathlib.Path(l)

            file = open(file_name, "r")
            lines = file.readlines()
            target_dir = str(file_name.parent)

            starting_line = 0
            ending_line = 0
            print(f'Extracting Run:   {target_run} / 10')
            for line in lines:
                if line.startswith(rf'Run:   {target_run} / 10'):
                    starting_line = lines.index(line)
                    break

            if target_run != 10:
                # print('the target was not 10')
                for line in lines[starting_line:]:
                    if line.startswith(rf'Run:   {(int(target_run) + 1)} / 10'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + str(target_run) + '.pdbqt'
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 9)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            if target_run == 10:
                # print('the target was 10')
                for line in lines[starting_line:]:
                    if line.startswith('    CLUSTERING HISTOGRAM'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + str(target_run) + '.pdbqt'
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 5)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            clean_pdbqt_content = []

            for line in stripped_pdbqt_content:
                cleaned_line = line[max(line.find('D'), 8):]
                if not cleaned_line.startswith('USER'):
                    clean_pdbqt_content.append(cleaned_line)

            with open(pdbqt_name, 'w') as f:
                for line_item in clean_pdbqt_content:
                    f.write("%s\n" % line_item)

            print(f'Saved extracted run to {pdbqt_name}')
target_dir = 'path_to_dir_with_dlg_files'
#num runs you expect to find
num_runs = 10
extract_top_autodock_pose(target_dir, num_runs)
diogomart commented 3 years ago

Closing, feel free to reopen if needed. Thanks @pgoverna and @abazabaaa for posting scripts!

Thanh-An-Pham commented 2 years ago

I wrote a number of scripts to do these things. I can't exactly write one that is general.. as it depends on how you choose to organize your results and how you want to run your pipeline. I've used this to extract the results of a campaign that came from docking 20M+ ligands into a receptor. It is robust. I can try to explain if you have questions, but im not quite ready to release the whole workflow yet. The code snippets below have the necessary pieces for carrying out three tasks in python:

  1. find the top scoring pose in the XML file.
  2. extract the top scoring pose from the DLG file.
  3. index the PDB you get following an openbabel job for PDBQT->PDB.

def prep_docking_results(run_dir, stage_dir, output_db_ext):
    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

    for l in data['canonical_ids']:
        assigned_batch_key = (pathlib.Path(stage_dir).stem+'_batch')
        job_assigned_id = data['canonical_ids'][l]['job_assigned_id']
        output_dir = os.path.join(stage_dir, 'output')
        xml_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}.xml')
        dlg_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}.dlg')
        # print(dlg_path)
        try:
            tree = ET.parse(xml_path)
            root = tree.getroot()
            docking_score = [x.get('mean_binding_energy') for x in root.findall(".//*[@cluster_rank='1']")]
            top_docking_run = [y.get('run') for y in root.findall(".//*[@cluster_rank='1']")]
            # print(docking_score)
            # print(top_docking_run)
            data['canonical_ids'][l].update({'mean_binding_energy':docking_score, 'top_docking_run':top_docking_run, 'combined_docking_poses':dlg_path})
        except FileNotFoundError:
            print(f'There was an error with prep_docking_results')
            print(f'The target xml_path was: {xml_path}')
            print("Wrong file or file path")

    with open(dict_path, 'w', encoding='utf-8') as jsonf:
        jsonf.write(json.dumps(data, indent=4))
def extract_top_autodock_pose(run_dir, stage_dir):     

    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

    for l in data['canonical_ids']:
        try:
            file_name = pathlib.Path(data['canonical_ids'][l]['combined_docking_poses'])
            target_run_list = data['canonical_ids'][l]['top_docking_run']
            target_run = target_run_list[0]
            file = open(file_name, "r")
            lines = file.readlines()
            target_dir = str(file_name.parent)

            starting_line = 0
            ending_line = 0
            # print(rf'Run:   {target_run} / 10')
            for line in lines:
                if line.startswith(rf'Run:   {target_run} / 10'):
                    starting_line = lines.index(line)
                    break

            if target_run != 10:
                # print('the target was not 10')
                for line in lines[starting_line:]:
                    if line.startswith(rf'Run:   {(int(target_run) + 1)} / 10'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + ".pdbqt"
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 9)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            if target_run == 10:
                # print('the target was 10')
                for line in lines[starting_line:]:
                    if line.startswith('    CLUSTERING HISTOGRAM'):
                        ending_line = lines.index(line)
                        break
                pre_pdbqt_name_prefix = os.path.splitext(file_name.stem)
                pdbqt_name_prefix = pre_pdbqt_name_prefix[0] + ".pdbqt"
                pdbqt_name = os.path.join(target_dir, pdbqt_name_prefix)
                pdbqt_content = lines[(starting_line + 4):(ending_line - 5)]
                stripped_pdbqt_content = [line.rstrip('\n') for line in pdbqt_content]

            clean_pdbqt_content = []

            for line in stripped_pdbqt_content:
                cleaned_line = line[max(line.find('D'), 8):]
                if not cleaned_line.startswith('USER'):
                    clean_pdbqt_content.append(cleaned_line)

            with open(pdbqt_name, 'w') as f:
                for line_item in clean_pdbqt_content:
                    f.write("%s\n" % line_item)
        except KeyError:
            print(f'There was a key error for extract_top_autodock_pose')
            print(f'The target dlg path was : {file_name}')
            pass
def index_3d_pose(run_dir, stage_dir, output_db_ext):
    dict_path = os.path.join(run_dir, f'{pathlib.Path(stage_dir).parent.stem}_dictionary.json') 
    with open(dict_path) as json_file: 
        data = json.load(json_file)

        for l in data['canonical_ids']:
            try:
                output_dir = os.path.join(stage_dir, 'output')
                assigned_batch_key = (pathlib.Path(stage_dir).stem+'_batch')
                job_assigned_id = data['canonical_ids'][l]['job_assigned_id']
                pdb_path = os.path.join(output_dir, data['canonical_ids'][l][assigned_batch_key], f'{job_assigned_id}{output_db_ext}')
                data['canonical_ids'][l].update({'final_docking_pose':pdb_path})
            except KeyError:
                pass

    with open(dict_path, 'w', encoding='utf-8') as jsonf:
        jsonf.write(json.dumps(data, indent=4))

Can you give me an example of input for run_dir, stage_dir, output_db_ext please ?

abazabaaa commented 2 years ago

The run directory is the parent directory for the the docking program I created. The directory structure looks like:

run directory ->stage directory -->input directory --->batch directory -->scripts directory --->batch directory -->output directory --->batch directory

The output db extension in this case would be a pdb file (which is where the final poses are stored).

I wrote all these scripts as I was learning python for the first time -- self taught -- so I think I could have done a little better at making the code human readable now that I look back. I would be happy to share the repository and walk you through the entire pipeline via zoom or something like that if you want.

It takes an SDF with up to 40M 3d structures (from Corina) (you could probably do 400M, but the slow step is extracting the top poses as it only happens on one cpu), splits them up, makes pdbqt files, docks them, extracts top docking poses and stores the results. There are some incomplete (somewhat unpolished scripts) that give a nice readout of results.