Using an open-source program (Gypsum-DL) for preparing small-molecule libraries for structure-based virtual screening

TomkUCL / SARS-CoV-2-Helicase-nsp13-Public-Antivirals-Virtual-Screening-Project

A repository for public suggestions towards SARS-COV-2 helicase antivirals using publicly-available software.

Apache License 2.0

0 stars 0 forks source link

Using an open-source program (Gypsum-DL) for preparing small-molecule libraries for structure-based virtual screening #2

Open TomkUCL opened 9 months ago

TomkUCL commented 9 months ago

This repository focuses on the installation and use of Gypsum-DL 1.2.1 built by the durrantlab. Gypsum-DL is a free, open-source program for preparing 3D small-molecule models for molecular docking and virtual screening applications. Beyond simply assigning atomic coordinates, Gypsum-DL accounts for alternate ionization, tautomeric, chiral, cis/trans isomeric, and ring-conformational forms often ignored by other programmes such as Open Babel.

It is released under the Apache License, Version 2.0 (see LICENSE.txt) and offers a free alternative to open babel with improved docking accuracy.

The original repository can be found here: https://github.com/durrantlab/gypsum_dl

Please note: this repository is an application of the excellent work done by the Durrant group to develop Gypsum to illustrate how non-computer savvy users (i.e. chemists like myself) can apply this useful tool to your drug discovery projects!

I would encourage you to read the relevant publications for Gypsum-DL to understand its benefits, which can be found here:

Ropp, Patrick J., Jacob O. Spiegel, Jennifer L. Walker, Harrison Green, Guillermo A. Morales, Katherine A. Milliken, John J. Ringe, and Jacob D. Durrant. (2019) "Gypsum-DL: An Open-source Program for Preparing Small-molecule Libraries for Structure-based Virtual Screening." Journal of Cheminformatics 11:1. doi:10.1186/s13321-019-0358-3.

Ropp PJ, Kaminsky JC, Yablonski S, Durrant JD (2019) Dimorphite-DL: An open-source program for enumerating the ionization states of drug-like small molecules. J Cheminform 11:14. doi:10.1186/s13321-019-0336-9.

TomkUCL commented 9 months ago

Getting Started:

1) Install Ubuntu Desktop (Linux command line interface) onto your computer https://ubuntu.com/download 2) Install anaconda for Linux-x86 https://www.anaconda.com/download https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh 3) Make a new folder (directory) in your Ubuntu terminal, e.g. cd /mnt > ls > cd d > mkdir gypsum_dl-1.2.1 > (> = press enter button) 4) Enter the new directory you have just created; cd gypsum_dl-1.2.1 > 5) Save your ligands.sdf file into the current directory 'gypsum_dl-1.2.1'. 6) Install the necessary third-party libraries (rdkit, scipy, and numpy) needed to run Gypsum-DL: conda install -c rdkit rdkit numpy scipy mpi4py

If rdkit does not install, try the install command from their own rdkit repository $ conda create -c conda-forge -n my-rdkit-env rdkit

6) Install Gypsum-DL-1.2.1 from the GitHub repository - https://github.com/durrantlab/gypsum_dl

7) Create and activate a new conda environment within the current directory using the Linux command line in Ubuntu:

conda create -c conda-forge --name gypsum_dl_env rdkit numpy scipy mpi4py -y

conda activate gypsum_dl_env

8) Run gypsum_dl using your specified input file path:

python run_gypsum_dl.py --source ./examples/sample_molecules.smi

For example, in my folder D drive containing folder gypsum_dl 1.2.1 > subfolder sdf_input_files > subfolder 505_selection_of_comb_lib_1.sdf I would type the following into the Linux command line:

python run_gypsum_dl.py --d/gypsum_dl-1.2.1/sdf_input_files/505_selection_of_comb_lib_1.sdf"

In my case I have also specified the output file location as "3D output files".

(gypsum_dl_env) tom@DESKTOP-LG9R7AE:/mnt/d/gypsum_dl-1.2.1$ python run_gypsum_dl.py --source /mnt/d/gypsum_dl-1.2.1/sdf_input_files/50_selection_of_comb_lib_1.smi --output /mnt/d/gypsum_dl-1.2.1/3D_output_files --separate_output_files

TomkUCL commented 9 months ago

Saving your Combinatorial Library Products as SMILES (.smi) File for Gypsum-DL Processing:

First, create smiles strings for your product structures using the Chemistry drop-down menu in Datawarrior:

2. Select the SMILES strings for you compounds you wish to use by selecting the column in datawarrior (shift + left click). Then copy and paste the SMILES column into a new Excel spreadsheet.

3. Next, copy and paste the SMILES into a new Excel spreadsheet and save as a new .txt file.

4. Open the .txt file, then save as file type 'All files', then delete the .txt extension on the file name and instead save as a .smi file

5. Lastly, run Gypsum-DL using the following command prompt;

python run_gypsum_dl.py --source YOUR FILE LOCATION

So in my case, this would be...

python run_gypsum_dl.py --source d/gypsum_dl-1.2.1/sdf_input_files/50_selection_of_comb_lib_1.smi

..or if you wanted each model (e.g. tautomer) generated for each ligand to be stored as its own .sdf file in the output folder '3D_output_files' within the current directory, the command I would type would look like this:

_(gypsum_dl_env) tom@DESKTOP-LG9R7AE:/mnt/d/gypsumdl-1.2.1$ python run_gypsum_dl.py --source /mnt/d/gypsum_dl-1.2.1/sdf_input_files/50_selection_of_comb_lib_1.smi --output /mnt/d/gypsum_dl-1.2.1/3D_output_files --separate_output_files

TomkUCL commented 9 months ago

Running Gypsum:

My Ubuntu command line when running Gypsum-DL looks like this;

After Gypsum is finished, two new files will appear in the directory, the new .sdf file containing your 3D ligands that ran successfully, and a .smi file that contains the SMILES that failed to run successfully. You can rename and save these in the appropriate folder, e.g. save as 50 ligands > 3D output files

Note that this is quite a computationally expensive method compared to say open babel; total run time for 50 SMILES strings was just over 30 mins on my computer...

However, the resulting .sdf files are much more accurate. For example, for molecule 50 there are 5 output structures (see below), including different tautomeric forms with hydrogens added, energy minimised, neutralised to pH 7.4 (default), and in multiple tautomeric and diastereomeric forms. Vina does not account for these forms, however these may predominate in solution and can be stabilised forms through protein-ligand interactions. Consequently, our Vina docking results will be much more accurate.

Unfortunately, this creates a slight problem. AutoDock Vina will only dock one model per .pdbqt file. So how do we create .pdbqt files for all of the tautomers/diastereomers we have now created? Don't worry we'll cover this now:
Once you have prepared your 3D ligands using Gypsum-DL, enter your folder containing your new combined 3D ligands .sdf file in the Ubuntu terminal (see right side below for example commands). Now you want to split the 50-ligand-containing .sdf file into individual ligand.sdf files. You can do this using the simple open babel command:

obabel -isdf YOURLIGANDFILE.sdf -osdf -O *.sdf --split

e.g.

obabel -isdf gypsum_dl_success.sdf -osdf -O *.sdf --split

After this command, the combined ligand .sdf file will be empty and many .sdf files will appear, one for each ligand from the Gypsum-DL calculation. You can delete these old empty files since we no longer need them.

Unfortunately, AutoDock Vina does not recognise .sdf ligand files for docking. Futhermore, each ligand .sdf file still contains multiple 'models' or tautomers/enantiomer of the same ligand generted by Gypsum-DL, so before we can run the virtual screen we still need to convert all .sdf files to .pdbqt files.

TomkUCL commented 9 months ago

Converting Your .sdf Output Files to .pdbqt for Virtual Screening

First, download and install OpenBabelGUI https://openbabel.org/docs/Installation/install.html. Once you have done this, select your new .sdf ligand files as your input files and select 'Output below only' on the right-hand side. This will print out the text of each MODEL within each .sdf. file into a single .pdbqt file in text format.

Next, copy all of the text into a new .txt file using Notepad, and name this '250_ligands.pdbqt' or whatever you want to call it.

Lastly, assuming you are in the folder/directory containing your 250_ligands.pdbqt file within your Ubuntu terminal, type in the following command and press enter:

awk '/^MODEL/{n++}{print > output_dir "output_prefix" n ".pdbqt"}' "input_dir/ligands.pdbqt" && csplit --suppress-matched "input_dir/ligands.pdbqt" '/^MODEL/' '{*}' && rm xx*

But remember to first replace "input_dir" with the path to your input directory and "output_dir" with the path to your output directory. Also, replace "output_prefix" with your desired prefix for the output files.
For example, my input directory is /mnt/d/gypsum_dl-1.2.1/3D_output_files/250_ligands.pdbqt and my output directory is /mnt/d/gypsum_dl-1.2.1/3D_output_files/, (i.e. the same directory) and I want the output files to be named 'model', the command would look like this:

awk '/^MODEL/{n++}{print > "model" n ".pdbqt"}' "/mnt/d/gypsum_dl-1.2.1/3D_output_files/250_ligands.pdbqt" && csplit --suppress-matched "/mnt/d/gypsum_dl-1.2.1/3D_output_files/250_ligands.pdbqt" '/^MODEL/' '{*}' && rm xx*

This command will read your .pdbqt file, then split it at each point where it reads a new 'MODEL' line to create a new .pdbqt file. Hence, creating a new .pdbqt file for each enantiomer or tautomer (model) for each ligand generated by Gypsum.DL For example, my terminal looks like this once I press enter:

Now if you open the directory, you should see individual .pdbqt files for each tautomer/enantiomer of your model. These ligands are now ready to be docked!

TomkUCL commented 9 months ago

Lastly, we need to remove the 'MODEL' line in each .pdbqt file, otherwise Vina will read each file as containing multiple models and will be unable to dock these. We can do this using a python script:

import os

def delete_first_line(file_path):
    # Read the content of the file
    with open(file_path, 'r') as file:
        lines = file.readlines()

    # Remove the first line starting with 'MODEL'
    lines = [line for line in lines if not line.startswith('MODEL')]

    # Write the modified content back to the file
    with open(file_path, 'w') as file:
        file.writelines(lines)

def process_files_in_directory():
    # Get the current directory
    directory = os.getcwd()
    # Iterate through each file in the directory
    for filename in os.listdir(directory):
        if filename.endswith('.pdbqt'):
            file_path = os.path.join(directory, filename)
            # Delete the first line starting with 'MODEL'
            delete_first_line(file_path)

if __name__ == "__main__":
    process_files_in_directory()
    print("Processing complete.")

Save this script as _delete_model_linepdbqt.py in the directory containing your .pdbqt files. Then, open a terminal, navigate to the directory containing both the script and the .pdbqt files, and execute the script:

python3 delete_model_line_pdbqt.py

This will process each .pdbqt file in the current directory, deleting the first line starting with 'MODEL' from each file. After processing, it will print "Processing complete."

Now that we have our ligands as individual .pdbqt files, we can now move onto the docking process using AutoDock Vina in Issue #1.

TomkUCL commented 7 months ago

(base) tom@DESKTOP-LG9R7AE:~$ conda activate gypsum_dl_env (gypsum_dl_env) tom@DESKTOP-LG9R7AE:~$ python run_gypsum_dl.py --source /mnt/d/gypsum_dl-1.2.1/input_f iles/Enamine_Aryl_halides_SNAr/1-1000_5rmm_combinatorial_library_lipinski_filtered.smi --output /mnt/d/gypsum_dl-1.2.1/3D_output_files/Enamine_Aryl_halides_SNAr --separate_output_files

TomkUCL commented 7 months ago

1) Activate anaconda environment for Gypsum-DL conda activate gypsum_dl_env

2) Go to Gypsum directory/folder in WSL

cd /mnt/d/gypsum_dl-1.2.1

3) Specify input and output folders and run

python run_gypsum_dl.py --source /mnt/d/gypsum_dl-1.2.1/input_files/Enamine_Aryl_halides_SNAr/1-1000_5rmm_combinatorial_library_lipinski_filtered.smi --output /mnt/d/gypsum_dl-1.2.1/3D_output_files/Enamine_Aryl_halides_SNAr --separate_output_files