Questions about using reduce2 as a standalone program

rwxayheee commented 1 week ago

Hi all,

Thank you for the excellent work. I have been using the original reduce to add hydrogens to protein structures, and I am eager to learn the new version. I have two quick questions about reduce2:

1. Handling of Ions

Is there a simple way to make an exclude list, to suppress the attempts to add hydrogens to ions and specific residues by names?
When ions (NA, CL, etc.) are encountered, the following message will be displayed and the program will exit. Sorry: Restraints were not found for the following residues: NA Is there an option to leave them alone, just like waters (HOH) are left unprotonated?

2. Unexpected Translation of Coordinates

Reduce2 moves (or re-centers?) a monomeric protein after hydrogen addition. This was not the default behavior of the original reduce. Here's an example:

Input PDB 1iep_chainA.pdb.txt This is the chain A of protein extracted from PDB 1IEP. Original structure is a dimer. The input only has the atom coordinates (therefore it's missing certain specifications like CRYST1, SCALE, ...)

The output with original reduce 1iep_chainAH.original.pdb.txt

The output with reduce2 1iep_chainAH.reduce2.pdb.txt

Is there a way to keep the original position? The translation won't happen when the original CRYST1, SCALE cards are present in the input file. If CRYST1, SCALE cards are required but not present in PDB structures that have been manipulated (manually edited, combined, generated by other programs, etc.), how to generate the proper cards to avoid translation of coordinates?

Thank you for your time and kind advice in advance!

russell-taylor commented 1 week ago

Reduce2 is still having problems adding restraints to some input files and Phenix staff are in the process of repairing these. If you have a failing file that you can share, please add it to this issue and I'll see if I can tell what is going on or push it to the top of the list.
The CCTBX interpretation fails without a crystal structure. To put one in place, reduce2 calls cctbx.maptbx.box.shift_and_box_model(), which apparently shifts the coordinates by default. I've modified the code to tell it not to shift the models and pushed that to a branch, which I'll merge with the main branch after testing so the fix should show up in the next release. You can use the reduce2-no-shift branch until that is merged, and the main branch once it is.

rwxayheee commented 1 week ago

Hi @russell-taylor

Thanks so much for your prompt reply. This is very helpful. The example structure I was having issue with ion handling is PDB 1IEP: https://www.rcsb.org/structure/1IEP It contains a chloride ion.

The other one was PDB 3KGD: https://www.rcsb.org/structure/3kgd It contains a sodium ion.

I have changed my workflow, such that the models were desalted first. I wanted to do it with iotbx but I'm very new to it. Could you please take a quick look at the codes?

from pathlib import Path
from iotbx.data_manager import DataManager

dm = DataManager()
filename_pdb = "1iep.pdb"
dm.process_model_file(filename_pdb)

model = dm.get_model(filename=filename_pdb)
hierarchy = model.get_hierarchy()
print(hierarchy.composition())

will give

group_args
  n_atoms                        : 4710
  n_chains                       : 6
  n_hd                           : 0
  n_nucleotide                   : 0
  n_nucleotide_atoms             : 0
  n_other                        : 8
  n_other_atoms                  : 80
  n_protein                      : 548
  n_protein_atoms                : 4458
  n_water                        : 172
  n_water_atoms                  : 172
  other_cnts                     : Counter({' CL': 6, 'STI': 2})

Then:

# Manipulate the Model: Desalt and Keep Chain A by Default
def cleanup_model(model_in, keep_chains = ["A"], exclude_residues = [" NA", " CL"]):
    model_out = model_in.deep_copy()

    for m in model_out.get_hierarchy().models():
        for chain in m.chains():
            if chain.id in keep_chains:
                for rg in chain.residue_groups():
                    for ag in rg.atom_groups():
                        if ag.resname in exclude_residues:
                            ag.parent().remove_atom_group(ag)
            else:
                chain.parent().remove_chain(chain)

    return model_out

model_desalt = cleanup_model(model)
hierarchy = model_desalt.get_hierarchy()
print(hierarchy.composition())

This prints:

group_args
  n_atoms                        : 2365
  n_chains                       : 3
  n_hd                           : 0
  n_nucleotide                   : 0
  n_nucleotide_atoms             : 0
  n_other                        : 1
  n_other_atoms                  : 37
  n_protein                      : 274
  n_protein_atoms                : 2229
  n_water                        : 99
  n_water_atoms                  : 99
  other_cnts                     : Counter({'STI': 1})

But this:

current_path = Path().resolve()
output_filename = f"{current_path}/1iep_chainA_complex.pdb"
dm.write_model_file(model_desalt, filename=output_filename, format="pdb")

Gives Segmentation fault. Could you please tell me what I'm doing wrong?

Writing an unmodified model

dm.write_model_file(model, filename=output_filename, format="pdb")

worked fine for me

I have a very clumsy way of manipulating the model (always do remove_atom) and the edited model can be written by dm.write_model_file:


def cleanup_model(model_in, keep_chains = ["A"], exclude_residues = [" NA", " CL"]):
    model_out = model_in.deep_copy()

    for m in model_out.get_hierarchy().models():
        for chain in m.chains():
            if chain.id in keep_chains:
                for rg in chain.residue_groups():
                    for ag in rg.atom_groups():
                        if ag.resname in exclude_residues:
                            for atom in ag.atoms():
                                ag.remove_atom(atom=atom)
            else:
                for rg in chain.residue_groups():
                    for ag in rg.atom_groups():
                        for atom in ag.atoms():
                            ag.remove_atom(atom=atom)

    return model_out

russell-taylor commented 1 week ago

When I run 1iep.cif and 3kgd.cif using a Phenix build of CCTBX, the files run to completion. My suspicion is that the chem_data folders are not being properly linked in when doing a non-Phenix build. I'll build the system that way and see if I can repair the problem.

I'm afraid that I won't be much help debugging iotbx workflows. I've spent almost all my time in the C++/Python linking world while developing Reduce2 and Probe2 and am not familiar with other parts of the system. Hopefully someone else will chime in.

russell-taylor commented 1 week ago

Okay, I've modified the bootstrap.py script in the master branch and built as follows: python bootstrap.py --use-conda --python=310 --no-boost-src --builder=molprobity

Then after running the setup script I can run mmtbx.reduce2 add_flip_movers=True output.overwrite=True 1iep.cif and it works.

(On Windows, you need to help it check out geostd by using a command prompt that knows about git and going into modules/chem_data/geostd and then doing: git config core.protectNTFS false followed by git checkout first.)

If this gets things working for you, go ahead and close the issue. Otherwise, let me know what went wrong.

olegsobolev commented 1 week ago

@rwxayheee , the cleanup you are trying to do is easier done with selections: sel = model_in.selection("chain A and not (resname NA or resname CL)") model_out = model_in.select(sel)

Docs on this: https://phenix-online.org/documentation/reference/atom_selections.html https://youtu.be/K28q378F39s?si=LFrSIJQ5n5mGaEdD

dcliebschner commented 1 week ago

As an alternative, you can also use the command line tool phenix.pdbtools (https://phenix-online.org/version_docs/dev-5239/reference/pdbtools.html) for selecting parts of a model.

rwxayheee commented 1 week ago

Hi @olegsobolev and @dcliebschner Thanks so much for the recommendations, I will check it out!

Hi @russell-taylor ! Thank you, that makes so much sense. Could you please take a quick look and let me know what's the correct way to set environment variable MMTBX_CCP4_MONOMER_LIB?

I very naively downloaded it from the geostd repository (https://github.com/phenix-project/geostd.git). The recent changes made me think it's up to date, but maybe I shouldn't download it this way. Here's a Colab notebook: https://colab.research.google.com/drive/1sXoI3-4jqzk4YG6CkYMQ3OWhIaBp13zm#scrollTo=Wgz1OwfCGbB3

In the Colab notebook, I downloaded geostd by:

# Download CCP4 Monomer (restraint) Library
monlib_repo = "https://github.com/phenix-project/geostd.git"
! git clone {monlib_repo}

Then to run reduce2, I did:

! export MMTBX_CCP4_MONOMER_LIB="{geostd_path}"; python {reduce2_path} {input_pdb} {reduce_opts}

I will try the way you suggested, and let you know how it goes. Thanks again for your kind suggestion

russell-taylor commented 6 days ago

The mechanisms that the CCTBX setup script uses to configure various environment variables are a mystery to me. I do know that in the past I had to set both the MMTBX_CCP4_MONOMER_LIB (at geostd) and the CLIBD_MON (at mon_lib) environment variables to make things work.

I don't have a deep understanding of the process, so I pretty much just follow my same recipe each time to get CCTBX environment:

wget https://raw.githubusercontent.com/cctbx/cctbx_project/master/libtbx/auto_build/bootstrap.py
Use the "x64 Native Tools Command Prompt" for Visual Studio, not the "Developer Studio 2022" prompt.
python bootstrap.py --builder=molprobity --use-conda --python=310 --nproc=1 --no-boost-src (maybe replacing nproc=1 with nproc=16 for a faster build).
Fix up the geostd checkout using git config core.protectNTFS false and then git checkout in the modules/chem_data/geostd directory.
In a command window, go to the build directory and source the appropriate setup script (setpaths.bat for Windows).

Then I can run mmtbx.reduce2 in the command window.

rwxayheee commented 5 days ago

Hello @russell-taylor

Thanks so much for your time and kind reply. My initial objective was to gather minimal data and dependencies to run reduce2 as a standalone, but I understand that the only proper way is to install the full assets. I will close the issue. Thanks everyone for your kind replies.

If you're interested, I have a little bit more information on the original issue:

I do know that in the past I had to set both the MMTBX_CCP4_MONOMER_LIB (at geostd) and the CLIBD_MON (at mon_lib) environment variables to make things work.

Yes, I learned this from some older tutorials of the cctbx project. I thought the two libraries were the only ones needed. But later I realized that for NA, CL (and maybe other monoatomic ions?), the chemical component library is necessary. I was able to pass by pointing the path in mmtbx/chemical_components/init.py to the path where I have a standalone chemical_components.

Starting /usr/local/lib/python3.10/site-packages/mmtbx/command_line/reduce2.py
on Thu Sep 12 21:47:31 2024 by root
===============================================================================

Processing files:
-------------------------------------------------------------------------------

  Found model, 1iep_receptor_reduce_input.pdb

Processing PHIL parameters:
-------------------------------------------------------------------------------

  Adding command-line PHIL:
  -------------------------
    approach=add
    add_flip_movers=True

Final processed PHIL parameters:
-------------------------------------------------------------------------------
  data_manager {
    model {
      file = "1iep_receptor_reduce_input.pdb"
    }
    default_model = "1iep_receptor_reduce_input.pdb"
  }
  add_flip_movers = True

Starting job
===============================================================================
Writing model output to 1iep_receptor_reduce_inputFH.pdb

                       ----------Loading Model----------                       

                      ----------Adding Hydrogens----------                     

[('0.000', '0.000', '0.000')]
('0.000', '0.000', '0.000')
[('0.000', '0.000', '0.000')]
('0.000', '0.000', '0.000')
[('0.000', '0.000', '0.000')]
('0.000', '0.000', '0.000')
[('0.000', '0.000', '0.000')]
('0.000', '0.000', '0.000')
Number of hydrogen atoms added to the input model: 2216 

                         ----------Optimizing----------                        

Boost.Python.ArgumentError: Python argument types in
    None.None(ExtraAtomInfo, NoneType)
did not match C++ signature:
    None(molprobity::probe::ExtraAtomInfo {lvalue}, double)

During handling of the above exception, another exception occurred:

Sorry: Could not find atom info for A CL 1 CL (perhaps interpretation was not run on the model?)

This is how far I can get at the moment (..but again I don't mind deleting these ions as long as they're not structural ;)

cctbx / cctbx_project

Questions about using reduce2 as a standalone program #1011

1. Handling of Ions

2. Unexpected Translation of Coordinates