DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
32 stars 12 forks source link

Bug: Functional group is unknown to the forcefield #582

Closed lloydtripp closed 5 months ago

lloydtripp commented 6 months ago

Describe the bug RD2 is reporting a lot of errors for different functional groups from the PDBs I'm attempting to process. The PDBs are generated by RoseTTAFold2 so I didn't expect to run into issues here. Here's one example of the error I'm getting:

Atom contact_atoms_SGCB_V200E A 131 1HB is unknown to the forcefield; vanderwaals_parameters set to (0.0, 0.0, 0.0, 0.0)

Environment:

To Reproduce Steps/commands/screenshots to reproduce the behaviour:

  1. Here's an example PDB SGCB_V200E.zip

  2. Code to produce issue (modified from the PPI tutorial)

queries = QueryCollection()

influence_radius = 8  # max distance in Å between two interacting residues/atoms of two proteins
max_edge_length = 8

print(f"Adding {len(variant_scoring_tbl)} queries to the query collection ...")
count = 0
for i in range(1): #range(len(variant_scoring_tbl)): # Testing for just one PDB but rest fail as well.
    queries.add(
        ProteinProteinInterfaceQuery(
            pdb_path=variant_scoring_tbl.at[i,"pdb_filepath"],
            resolution="residue",
            chain_ids=["A", "C"],
            influence_radius=influence_radius,
            max_edge_length=max_edge_length,
            targets={
                "DMS Score": variant_scoring_tbl.at[i, "DMSscore"]  # DMS Score for the variant
            },
        )
    )
    count += 1

queries.process(
    prefix=analysis_residue_dir,
    feature_modules=[components, contact],
    combine_output=True,
)

Expected Results I'm expecting graph representations of my protein structures.

Actual Results or Error Info Error output from RD2 RD2 Error Output.txt Residue HDF5 residue.zip

Other Comments Seems similar to the problems reported here that were supposedly fixed #380

lloydtripp commented 6 months ago

I switched over to doing SingleResidueVariantQuery and running into the same problem. It looks like the HDF5 but I haven't yet looked to see if it's reasonable.

Here's my code now (modified from SRV tutorial):

queries = QueryCollection()

influence_radius = 20.0  # radius to select the local neighborhood around the SRV
max_edge_length = 10 # Rounded up to the nearest integer. Before was 4.5 
variant_chain_id = "A"

print(f"Adding {len(pdb_files)} queries to the query collection ...")
count = 0
for i in range(1): # range(len(pdb_files)):
    queries.add(
        SingleResidueVariantQuery(
            pdb_path=variant_scoring_tbl.at[i,"pdb_filepath"],
            resolution="residue",
            chain_ids=variant_chain_id,
            variant_residue_number=variant_scoring_tbl.at[i,"Position"],
            insertion_code=None,
            wildtype_amino_acid=amino_acids_by_letter[variant_scoring_tbl.at[i, "Original Amino Acid"]],
            variant_amino_acid=amino_acids_by_letter[variant_scoring_tbl.at[i, "Mutated Amino Acid"]],
            targets={"DMSscore": variant_scoring_tbl.at[i, "DMSscore"]},
            influence_radius=influence_radius,
            max_edge_length=max_edge_length,
        )
    )
    count += 1
    if count % 100 == 0:
        print(f"{count} queries added to the collection.")

print(f"Queries ready to be processed.\n")

#grid_settings = GridSettings( None) # I don't want grid settings

#grid_map_method = None #I don't want grids

queries.process(
    prefix=os.path.join(hdf5_dir, "variant"),
    feature_modules=[components, contact],
    # Commenting out to utilize all the cpu. Was cpu_count=8,
    combine_output=False
    )

print(
    f'The queries processing is done. The generated HDF5 files are in {hdf5_dir}.'
)

Here's the HDF5 output: variant-78208.zip

Side note, I'm having issues running processing with multiple PDBs (multiple query.add executions). When I'm only adding one query.add then process then it's fine. The process command will never completes (runtime is wayyy longer than expected based on one structure alone).

gcroci2 commented 6 months ago

Hi @lloydtripp, thanks for opening the issue! I reproduced your steps and the code gives me the same messages, which are warnings for the user, not errors. They mean that for the forcefield currently used in the package, some of the atoms listed in your PDB files are not known, thus the charge or the vanderwaals parameters are set to 0 in such cases. But the generated features saved in the HDF5 file are correct - according to the forcefield that we are using.

Here I plotted a histogram for the two "problematic" features, i.e., electrostatic and vanderwalls:

energies_hist

The electrostatic one makes sense to me. But I find it a bit odd that the vanderwalls feature has such high values. Any ideas whether this could make sense? @DaniBodor @cbaakman

cbaakman commented 6 months ago

Like the warning says:

Atom contact_atoms_SGCB_V200E A 131 1HB is unknown to the forcefield; vanderwaals_parameters set to (0.0, 0.0, 0.0, 0.0)

Unknown atoms get zero vanderwaals parameters, thus the final lennard jones potential becomes zero:

eps = sqrt(eps1 *  eps2)
LJ = 4 * eps * ((sigma, r) ** 12 - (sigma, r) ** 6) 

That's what we see in the plot made by @gcroci2

We must set all involved atoms to nonzero values. First, we need a list of all the unknown atom names that RoseTTAFold2 generates. This is hopefully just the hydrogens.

@LilySnow @NicoRenaud any idea where the parameters in this forcefield originally came from?

gcroci2 commented 5 months ago

Like the warning says:

Atom contact_atoms_SGCB_V200E A 131 1HB is unknown to the forcefield; vanderwaals_parameters set to (0.0, 0.0, 0.0, 0.0)

Unknown atoms get zero vanderwaals parameters, thus the final lennard jones potential becomes zero:

eps = sqrt(eps1 *  eps2)
LJ = 4 * eps * ((sigma, r) ** 12 - (sigma, r) ** 6) 

That's what we see in the plot made by @gcroci2

Indeed, but I'd expect an all-zeros feature for vanderwaalls, and instead I see that the mean is something around 10^11, with a very high standard deviation as well. @cbaakman

cbaakman commented 5 months ago

OK sorry, I didn't notice the small letters below.

I think those high vanderwaals values can be explained by the quality of the structure. When I load it into pymol, I see atoms sitting too close together:

pymol

gcroci2 commented 5 months ago

OK sorry, I didn't notice the small letters below.

I think those high vanderwaals values can be explained by the quality of the structure. When I load it into pymol, I see many atoms sitting too close together:

pymol

Can this suggest that the structure is somehow wrong? @cbaakman

cbaakman commented 5 months ago

I think so yes, but it's on the edge. Because there are also good parts in it. If you want to see for yourself, use pymol or YASARA to view the structure. Compare it with a PDB entry, to see what a protein structure normally looks like. https://www.rcsb.org/structure/101M

gcroci2 commented 5 months ago

To put our thoughts together, even if the energy calculation can be improved (and we're currently discussing that in #393), the code works fine from our side. The high vanderwaals values we get seem to be due to an issue related to the structure itself, and not to the features generation. @lloydtripp I suggest you investigate such parts and ask directly in RoseTTAFold2 repository.

DaniBodor commented 5 months ago

@gcroci2, I think @lloydtripp's problem wasn't so much the large vdw values (this is something you flagged), but the warning's about them being set to 0.* (Although I do agree that this is something that you may want to investigate to be sure it's correct)

@lloydtripp , as I don't think you are getting any actual errors, merely warnings (at least that's what the output tells me), there is nothing that is formally stopping you from proceeding with your analysis. The warnings can be explained by the fact that the code currently isn't dealing with hydrogens very well, which is something we are in the process of resolving (the issue that you reference where you say it was supposedly fixed actually points to another issue that is still open and was merely closed to not have duplicate issues). You have a few options of how to proceed:

  1. ignore the warnings and proceed with the hydrogens having 0 energy
  2. you can test this package to preprocess the pdb files. I am not positive this will resolve all the warnings, though
  3. scrape the hydrogens from your pdb files altogether (I believe there are tools for this as well).

Hope this discussion is useful to you and please don't hesitate to ask if anything we mention here is unclear.

lloydtripp commented 5 months ago

Thanks for the comments everybody!

I'll take a look at the protein folds from RoseTTaFold2. I haven't been manually inspecting them for issues so thank you for bringing this to my attention.

In terms of data processing within Deeprank2, I'll ignore the warnings for now.

LilySnow commented 5 months ago

This issue is caused by the force field used by DeepRank2. It is using OPLS. So probably it cannot recognize the atoms in your pdb file...

gcroci2 commented 5 months ago

This issue is caused by the force field used by DeepRank2. It is using OPLS. So probably it cannot recognize the atoms in your pdb file...

Indeed, we'll add this info to the docs (see #593)