CD mol2/pdb/sd files coordinates starting in the binding pocket

andrrizzi commented 7 years ago

Fix #41.

I ended up implementing the file editing part myself because I was having trouble finding a library that supports file extensions and/or doesn't change atom names.

This is the script I used in case you want to review the exact procedure. I'm not sure if it can be useful to add it to a scripts/ folder in the repo.

#!/usr/bin/env python

import re
import os
import glob

import numpy
import mdtraj

def replace_positions(file_path, positions_in_angstroms):
    positions = positions_in_angstroms  # Shortcut

    # Load existing file.
    with open(file_path, 'r') as f:
        lines = f.readlines()

    pos_idx = 0
    found_atom_directive = False  # Used only for mol2 files.
    extension = os.path.splitext(file_path)[1][1:]

    # Modify positions.
    with open(file_path, 'w') as f:
        for line_idx, line in enumerate(lines):

            # PDB files
            if extension == 'pdb':
                if line[:4] == 'ATOM':
                    coordinates_str = ('{:8.3f}' * 3).format(*positions[pos_idx])
                    line = line[:30] + coordinates_str + line[54:]
                    pos_idx += 1

            # Mol2 files
            elif extension == 'mol2':
                if '@<TRIPOS>' in line:
                    if '@<TRIPOS>ATOM' in line:
                        found_atom_directive = True
                    else:
                        found_atom_directive = False
                elif found_atom_directive:
                    # Find all words separated by whitespaces and their indices.
                    words = [(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', line)]
                    # The coordinates are the 3rd, 4th and 5th columns. Find number of
                    # character for each coordinate to maintain column alignments.
                    coordinates_columns_widths = [words[i][1][1] - words[i-1][1][1] for i in range(2, 5)]
                    coordinates_str = '{{:{}.4f}}{{:{}.4f}}{{:{}.4f}}'.format(*coordinates_columns_widths)
                    coordinates_str = coordinates_str.format(*positions[pos_idx])
                    line = line[:words[1][1][1]+1] + coordinates_str + line[words[5][1][0]-1:]
                    pos_idx += 1

            # SDF files
            elif extension == 'sd' or extension == 'sdf':
                if line_idx == 3:
                    n_atoms = int(line[:3])
                elif line_idx > 3 and pos_idx < n_atoms:
                    coordinates_str = ('{:10.4f}' * 3).format(*positions[pos_idx])
                    line = coordinates_str + line[30:]
                    pos_idx += 1

            else:
                raise RuntimeError('Unrecognized extension {}'.format(extension))

            # Write line with (eventually) replaced positions.
            f.write(line)
    # We should have replaced all positions.
    assert pos_idx == len(positions)

if __name__ == '__main__':
    data_dirs = os.path.join('input_files', 'cd-set*')

    for data_dir in glob.glob(data_dirs):
        pdb_dir = os.path.join(data_dir, 'pdb')
        amber_dir = os.path.join(data_dir, 'prmtop-rst7')

        old_receptor_file_path = ''

        # Loop over all system files.
        prmtop_file_paths = os.path.join(amber_dir, '*cd-*-s.prmtop')
        for prmtop_file_path in glob.glob(prmtop_file_paths):
            system_file_name = os.path.splitext(os.path.basename(prmtop_file_path))[0]
            print('Processing {}'.format(system_file_name))

            # Identify file paths to system positions, single host and single guest.
            rst_file_path = os.path.join(amber_dir, system_file_name + '.rst7')
            receptor_file_path = os.path.join(pdb_dir, 'host-' + system_file_name[:3] + '.pdb')
            ligand_id = system_file_name.split('-')[1]
            ligand_file_path = os.path.join(pdb_dir, 'guest-' + ligand_id + '.pdb')

            # Load receptor single molecule. The host won't change for the same cd set.
            if old_receptor_file_path != receptor_file_path:
                receptor_traj = mdtraj.load(receptor_file_path)

            # Load ligand single molecule.
            ligand_traj = mdtraj.load(ligand_file_path)

            # Load amber system and remove solvent/ions.
            system_topology = mdtraj.load_prmtop(prmtop_file_path)
            system_traj = mdtraj.load_restrt(rst_file_path, top=system_topology)
            system_traj.remove_solvent(inplace=True)
            system_topology = system_traj.topology

            # Create a Trajectory with both receptor and ligand.
            joined_topology = receptor_traj.topology.join(ligand_traj.topology)
            joined_coordinates = numpy.concatenate((receptor_traj.xyz[0], ligand_traj.xyz[0]), axis=0)
            joined_traj = mdtraj.Trajectory(joined_coordinates, joined_topology)

            # Make sure that we are setting the correct positions to the correct atoms.
            assert system_topology.n_atoms == joined_topology.n_atoms
            for a1, a2 in zip(system_topology.atoms, joined_topology.atoms):
                assert a1.name == a2.name

            # First align system receptor to single receptor.
            receptor_atom_indices = list(range(receptor_traj.n_atoms))
            system_traj.superpose(joined_traj, atom_indices=receptor_atom_indices, ref_atom_indices=receptor_atom_indices)

            # Then we align the single ligand to the system ligand.
            ligand_atom_indices = list(range(receptor_traj.n_atoms, joined_traj.n_atoms))
            joined_traj.superpose(system_traj, atom_indices=ligand_atom_indices, ref_atom_indices=ligand_atom_indices)

            # Replace positions in old files.
            positions_in_angstroms = joined_traj.xyz[0][ligand_atom_indices] * 10
            for extension in ['pdb', 'mol2', 'sd']:
                guest_file_path = os.path.join(data_dir, extension, 'guest-' + ligand_id + '.' + extension)
                assert os.path.exists(guest_file_path)
                replace_positions(guest_file_path, positions_in_angstroms)

davidlmobley commented 7 years ago

So, just so I understand and for the record -- for some reason the PDB/mol2 files for the guests had the guest coordinates outside the binding site, whereas only the coordinate files had the guests in the binding site, so this would resolve the issue and make the coordinates for all files be consistent?

@nhenriksen can you review?

nhenriksen commented 7 years ago

@davidlmobley the PDB/mol2 files for the separate host and guests were originally intended to be stand alone, and required the user to generate bound conformations. The prmtop/rst7 offered a plausible, solvated bound pose. But I can see how this might not fit into some workflows, where the user might want a gas phase bound structure.

@andrrizzi You mentioned using an alignment step to match up the all the bound states, but there is still only one host PDB/mol2 in the directory, correct? Did you do any checks to make sure there aren't residual steric overlaps when you load a given guest structure together with that particular host conformation?

andrrizzi commented 7 years ago

make the coordinates for all files be consistent?

Almost. The positions between rst7 complex files and PDB/mol2/sd single-molecule files can't be the same because the rst7 files are taken after equilibration so the host positions are all different. Here I've changed the pdb/mol2 guests positions to be placed in the binding site of the pdb/mol2 host with a binding pose very similar to the one in rst7 files.

there is still only host PDB/mol2 in the directory, correct? Did you do any checks to make sure there aren't residual steric overlaps?

Correct. I checked for steric clashes by eye in PyMol and everything looked fine to me. I'll make my simulations start from these positions anyway, so I'll discover soon if there is anything problematic.

MobleyLab / benchmarksets

CD mol2/pdb/sd files coordinates starting in the binding pocket #44