ccsb-scripps / AutoDock-Vina

AutoDock Vina
http://vina.scripps.edu
Apache License 2.0
561 stars 199 forks source link

Virtual screening using Python bindings #320

Closed eunos-1128 closed 2 weeks ago

eunos-1128 commented 3 weeks ago

Hi,

I'm trying to virtual screening with vina's batch mode using Python API.

According to docs, set_ligand_from_file looks to take multiple ligand PDBQT files as arguments.

I think PDBQT file can have multiple models in it.

Which is appropriate for vina's batch mode, one PDBQT file with multiple models or multiple PDBQT files where each one has only one model? (I've never tried and compared both of two ways. )

diogomart commented 3 weeks ago

Hi Vina expects multiple PDBQT files where each one has only one model.

eunos-1128 commented 3 weeks ago

Thank you for answering.

eunos-1128 commented 2 weeks ago

Using Python bindings, what is the difference between normal docking and batch mode?

The only difference seems to be the ability to dock without initialising a new Vina instance each time.

Could you please tell me if there are any advantages that are only available in batch mode? (I couldn't tell)

diogomart commented 2 weeks ago

That is the only advantage, which makes it faster.

eunos-1128 commented 2 weeks ago

When using set_ligand_from_file, the program would sometimes crash due to memory overflow because there were too many files in the virtual screening.

When using Python bindings, if there are too many files, I will try to perform normal docking instead of using batch mode.

diogomart commented 2 weeks ago

That sounds like a bug. What was the approximate number of ligands that made it crash? More like thousands, or more like millions?

rwxayheee commented 2 weeks ago

Hi @eunos-1128 Could you provide a bit more details of your hardware, OS and where do you run from (Jupyter notebook has a memory limit, while a python script shouldn't)?

eunos-1128 commented 2 weeks ago

@diogomart

That sounds like a bug. What was the approximate number of ligands that made it crash? More like thousands, or more like millions?

About a million ligands per a docking.

@rwxayheee

I execute the program using Jupyter notebook on a linux workstation, which I didn't know that would be the limitation.

rwxayheee commented 2 weeks ago

Hi @eunos-1128 Jupyter notebook has a default memory limit (you could use a search engine to read more on it). It might be helpful to try running the same codes outside of Jupyter notebook. You can copy all codes to a python script, then use your environment python to run it. Otherwise, try increasing Jupyter notebook's memory limit might help, too, but I'm not entirely sure how much it needs to dock a million ligands.

I have never done a very thorough memory monitoring on each steps, so I'm not very sure how memory usage is affected by the way we run it. For a very long time, I was not aware of the batch mode and just run one Vina command for each ligand... Ideally, memory usage shouldn't scale with number of ligands in the batch mode (?) I can take a closer look when I get a chance

eunos-1128 commented 2 weeks ago

@rwxayheee

Thanks for the advice.

I checked and it seems that Jupyter notebook can relax the memory limit, which I'm trying now.

If that doesn't work, I will move the code to a python file and run it.

eunos-1128 commented 2 weeks ago

@diogomart @rwxayheee

v.set_ligand_from_file('1iep_ligand.pdbqt')
v.compute_vina_maps(center=[15.190, 53.903, 16.917], box_size=[20, 20, 20])

The next lines are used to first load a PDBQT file containing the ligand called 1iep_ligand.pdbqt and then compute the affinity maps for each ligand atom types accroding to the Vina forcefield. You might need to read first the tutorial Basic docking to learn how to create a PDBQT file of a ligand. There is a small subility here, the behavior of the compute_vina_maps() function changes if the ligand was loaded before or after computing the vina maps. If no ligand was initialized, compute_vina_maps() will compute the affinity map for each atom types defined in the Vina forcefield (22 in total). This is very useful when we want to dock ligands in batch (a.k.a virtual screening) but we don’t necessarily know beforehand what atom types will be necessary for thoses ligands. Alternately to set_ligand_from_file(), you could also load a molecule using a molecule string in PDBQT format using the set_ligand_from_string() function.

I would like to confirm something about this part of the documentation.

Does this mean that I should run set_ligand_from_file after running compute_vina_maps if I want to dock multiple ligands at once, for example for virtual screening?

Does the order in which these functions are executed change the docking results (i.e. docking score and ligand RMSD) significantly?

rwxayheee commented 2 weeks ago

Hi @eunos-1128 Running compute_vina_maps before setting the first ligand can save time. The documentation recommends this, such that the maps are computed only once, but they are good for all ligands. No more redundant steps to compute the corresponding maps for ligands of different compositions. Doing set_ligand_from_file before or after set_ligand_from_file shouldn't change the results of the docking. However, if it does, please do let us know.

eunos-1128 commented 2 weeks ago

@rwxayheee

Thank you!

I'll try both of two ways to compare results if I have time enough to do.

eunos-1128 commented 2 weeks ago

I used the write_poses method to write out the docking results after virtual screening using Python bindings, but there was no output.

When I set one ligand PDBQT at a time, not in batch mode, the output files were written out without any problem, but when I set multiple PDBQT files in batch mode, nothing was output.

I can't think of a cause for this, but do you have any ideas?

rwxayheee commented 2 weeks ago

Hi @eunos-1128 Could you show an example of your codes for batch docking? Doesn't need specific ligand or receptor file, but I wish to know the order of commands (initialize vina, set receptor, compute maps, set ligand, write poses) and how they are arranged in loop. It would be tremendously helpful if you could provide that information. I will try to reproduce with my files

rwxayheee commented 2 weeks ago

one more thing, in your first post

According to docs, set_ligand_from_file looks to take multiple ligand PDBQT files as arguments.

This might not exactly be the batch mode you're looking for virtual screening. Doing something like the following:

v.set_ligand_from_file(['Ligand_1.pdbqt', 'Ligand_2.pdbqt'])

Will generate poses each containing two ligands (could be another cause for the unexpected memory surge). The multiple ligand PDBQT files are taken as arguments for docking multiple ligands simultaneously. If you want to perform docking one at a time, you need to read them in one by one, use write_poses after each dock within the loop.

eunos-1128 commented 2 weeks ago

@rwxayheee

Hi, I use the function below for virtual screening.

def dock_ligands_to_protein(prot_name):
    ligand_files = glob.glob(f'{PARENT_DIR}/actives/{prot_name}/active_[1-2].pdbqt')
    ligand_files += glob.glob(f'{PARENT_DIR}/decoys/{prot_name}/decoy_[1-2].pdbqt')
    result_file = f'{PARENT_DIR}/results/vs/all/ligands_to_{prot_name}.pdbqt'

    if os.path.exists(result_file):
        return

    try:
        print(f"Starting docking {prot_name} and compounds(active/decoy)...")

        vina = Vina(sf_name="vina", seed=42, verbosity=1)

        vina.set_receptor(
            f"{PARENT_DIR}/receptors/{prot_name}_elem_charge_fixed-FHs.pdbqt"
        )

        pml.load(
            os.path.expanduser(f"~/eunos/dockings/datasets/DUD-E/all/{prot_name}/crystal_ligand.mol2")
        )
        centroid_coords = pml.centerofmass('crystal_ligand')
        pml.delete('all')

        vina.compute_vina_maps(center=centroid_coords, box_size=BOX_SIZE)
        vina.set_ligand_from_file(ligand_files)

        pp.pprint(vina.info())

        vina.randomize()
        vina.dock(exhaustiveness=32, n_poses=20)

        print(f"Finishing docking {prot_name} and compounds(active/decoy)!")

        vina.write_poses(
            result_file,
            energy_range=10000.0,
            overwrite=True,
            n_poses=20
        )

    except Exception as e:
        print(f"Exception occurred for {prot_name}: {e}")

    finally:
        pml.delete("all")
        gc.collect()
rwxayheee commented 2 weeks ago

Hi @eunos-1128 Thanks for posting the commands. I think you might be performing multiple-ligand docking (trying to fit many ligands into one structure), not individual screening. If you want to perform screening, for individual ligands, you could do:

def dock_ligands_to_protein(prot_name):
    ligand_files = glob.glob(f'{PARENT_DIR}/actives/{prot_name}/active_[1-2].pdbqt')
    ligand_files += glob.glob(f'{PARENT_DIR}/decoys/{prot_name}/decoy_[1-2].pdbqt')

    if os.path.exists(result_file):
        return

    try:
        print(f"Starting docking {prot_name} and compounds(active/decoy)...")

        vina = Vina(sf_name="vina", seed=42, verbosity=1)

        vina.set_receptor(
            f"{PARENT_DIR}/receptors/{prot_name}_elem_charge_fixed-FHs.pdbqt"
        )

        pml.load(
            os.path.expanduser(f"~/eunos/dockings/datasets/DUD-E/all/{prot_name}/crystal_ligand.mol2")
        )
        centroid_coords = pml.centerofmass('crystal_ligand')
        pml.delete('all')

        vina.compute_vina_maps(center=centroid_coords, box_size=BOX_SIZE)

        for ligand_file in ligand_files:
            ligand_name = os.path.basename(ligand_file).replace('.pdbqt','')
            vina.set_ligand_from_file(ligand_file)

            vina.randomize()
            vina.dock(exhaustiveness=32, n_poses=20)

            print(f"Finishing docking {prot_name} and compound: {ligand_file}!")

            result_file = f'{PARENT_DIR}/results/vs/all/{ligand_name}_to_{prot_name}.pdbqt'

            vina.write_poses(
                result_file,
                energy_range=10000.0,
                overwrite=True,
                n_poses=20
            )

    except Exception as e:
        print(f"Exception occurred for {prot_name}: {e}")

    finally:
        pml.delete("all")
        gc.collect()

i didn't test the codes. But i hope this could explain what I meant by individual virtual screening. Also, the input PDBQT files need to have exactly one ligand each. I don't think multiple-ligand PDBQT inputs are currently supported

eunos-1128 commented 2 weeks ago

@rwxayheee

Thank you for reviewing the code.

I checked set_ligand_from_file and found it seems to deal with batch ligands. https://github.com/ccsb-scripps/AutoDock-Vina/blob/v1.2.5/build/python/vina/vina.py#L164-L186

You meant those ligands(pdbqt_filename whose type is list or tuple of PDBQT file paths) are intended to execute multiple-ligands docking?

Also, the input PDBQT files need to have exactly one ligand each. I don't think multiple-ligand PDBQT inputs are currently supported

I understand this point.

rwxayheee commented 2 weeks ago

Hi @eunos-1128

You meant those ligands are intended to execute multiple-ligands docking?

Yes, in my understanding it's for docking multiple ligands simultaneously. You can do a quick check by giving two ligands, and check out the output - each "pose" will contain two ligands.

I think initially you were docking multiple ligands simultaneously (trying to generate a co-complex of receptor and many, many ligands) which led to the unexpected memory surge

eunos-1128 commented 2 weeks ago

@rwxayheee

I think initially you were docking multiple ligands simultaneously (trying to generate a co-complex of receptor and many, many ligands) which led to the unexpected memory surge

I see...

I'm trying the way you proposed. Thank you.

eunos-1128 commented 2 weeks ago

@rwxayheee

It looks working for the time being.

Thanks a lot!