MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.26k stars 641 forks source link

Implement lDDT calculation #4134

Open apayne97 opened 1 year ago

apayne97 commented 1 year ago

Is your feature request related to a problem?

I'd like to directly compare AlphaFold predicted structures, datasets of PDBs and trajectories. LDDT https://academic.oup.com/bioinformatics/article/29/21/2722/195896?login=true#92198107 seems like a good tool to do this and seems possible to implement, although it will potentially scale poorly with increasing protein size.

Describe the solution you'd like

I'd like to be able to run LDDT calculations!

Describe alternatives you've considered

I'm going to ask mdtraj to add this feature as well :P Also, SwissModel has a tool for it https://swissmodel.expasy.org/lddt And I guess it is potentially implemented with OpenStructure although I haven't tested it. https://www.openstructure.org/docs/2.4/mol/alg/lddt/

Additional context

IAlibay commented 1 year ago

@apayne97 - I'm just having a super quick look through here, is IDDT just an an all-atoms self-distance calculation w/ a threshold? Sorry for the silly question - I just can't seem to see an equation in the reference paper you link to.

apayne97 commented 1 year ago

It's a bit maddening that they didn't publish equations in this paper.

I think for a two model comparison of reference A to model B:

  1. For models A and B, calculate pairwise distances D_i,j A) smaller than a certain threshold (R_0) and B) where i_residue != j_residue
  2. For each pairwise distance in the union of both distance vectors, if the distance is the same (within a certain tolerance) in both models, it is considered preserved, and the fraction F of the conserved distances is calculated.
  3. The published LDDT score is the average of F for tolerances of 0.5Å, 1Å, 2Å, 4Å.

In the paper they show some results suggesting an optimal choice of R_0 of 15Å.

You can download the code (binaries?) at the swissmodel site. But I'm not sure if that will help or not.

IAlibay commented 1 year ago

Thanks! That sounds like quite a fun thing to implement, either here or as an mdakit.

It might also make sense to directly interface with openstructure, would need to double check their license / input style.

apayne97 commented 1 year ago

Are you connected with them at all? I am not.

I think I'd be happy to take a stab at implementing but have no confidence in doing it the fastest way.

apayne97 commented 1 year ago

Looks like there might be an implementation from the Baker Lab: https://github.com/hiranumn/DeepAccNet/blob/21c1775f841bea685b17c688a64da733e5d55cf7/deepAccNet_noPyRosetta/model.py

https://github.com/hiranumn/DeepAccNet/blob/21c1775f841bea685b17c688a64da733e5d55cf7/deepAccNet/model.py

sukritsingh commented 1 year ago

Here's a link to Swiss-Model's source code for how they run lDDT: https://www.openstructure.org/docs/2.4/install/

IAlibay commented 1 year ago

Are you connected with them at all? I am not.

No, I don't believe anyone in the MDAnalysis coredevs is.

Sorry here by "interface" I meant that one could feasibly just make a "converter" that takes an MDAnalysis AtomGroup and uses their Python API to run the actual calculation.

That being said, doing it in MDAnalysis purely also shouldn't be too big a deal either by looping over your residues and using MDAnalysis.lib.distances.capped_distance.

I think I'd be happy to take a stab at implementing but have no confidence in doing it the fastest way.

That's definitely not an issue! So here are two potential options for you @apayne97:

  1. Implementing this as part of a "protein analysis" MDAKit. It's something I've I actually wanted to create for some time as a palce to have things like "structure checks", "more useful" contact analysis codes, and also maybe move some of the bioinformatics-y methods out of the core library. If that's something that is of interest to you, I'd be happy to work towards making this happen alongside @ianmkenney, @fiona-naughton and any other interested party. Once that's all up and running we can start by adding it to the MDAKits repo (https://mdakits.mdanalysis.org/), published in JOSS and eventually even placed back into the core library if it gets lots of use.

  2. You could open a PR against the core library and then we could look to see if it can be directly added here directly. That's a bit of a longer process, and would depend on if the other @MDAnalysis/coredevs think it would be a good fit.

orbeckst commented 1 year ago

Looks useful!

apayne97 commented 1 year ago

Sorry here by "interface" I meant that one could feasibly just make a "converter" that takes an MDAnalysis AtomGroup and uses their Python API to run the actual calculation.

Ohhh. Got it. Makes sense!

Implementing this as part of a "protein analysis" MDAKit. It's something I've I actually wanted to create for some time as a palce to have things like "structure checks", "more useful" contact analysis codes, and also maybe move some of the bioinformatics-y methods out of the core library.

Love this idea! Would be happy to contribute!

You could open a PR against the core library and then we could look to see if it can be directly added here directly. That's a bit of a longer process, and would depend on if the other @MDAnalysis/coredevs think it would be a good fit.

I guess either way I (+ other helpful Chodera Lab folks) can write up a proof of principle and we can consider adding to repo

apayne97 commented 1 year ago

One thing I'm not sure how to standardize (and would like to dig into the other implementations) is the alignment for the two proteins. I'm thinking we'd want users to submit a mapping of residues so that we don't have to do our own sequence alignment / etc. I think this is what the OpenStructure implementation does: https://git.scicore.unibas.ch/schwede/openstructure/-/blob/master/modules/mol/alg/pymod/lddt.py

From https://www.openstructure.org/docs/2.4/mol/alg/lddt/

One can replicate the binary using simple python script:

#! /bin/env python
"""Run lDDT from within script."""
from ost.io import LoadPDB
from ost.mol.alg import (CleanlDDTReferences,
                         PreparelDDTGlobalRDMap,
                         lDDTSettings,
                         CheckStructure,
                         LocalDistDiffTest,
                         GetlDDTPerResidueStats,
                         PrintlDDTPerResidueStats,
                         ResidueNamesMatch)
from ost.io import ReadStereoChemicalPropsFile

model_path = "Path to your model pdb file"
reference_path = "Path to your reference pdb file"
structural_checks = True
bond_tolerance = 12
angle_tolerance = 12
cutoffs = [0.5, 1.0, 2.0, 4.0]
#
# Load model and prepare its view
model = LoadPDB(model_path)
model_view = model.GetChainList()[0].Select("peptide=true")
#
# Prepare references - it should be alist of EntityView(s)
references = [LoadPDB(reference_path).CreateFullView()]
#
# Initialize settings with default parameters and print them
settings = lDDTSettings()
settings.PrintParameters()

#
# Clean up references
CleanlDDTReferences(references)
#
# Prepare residue map from references
rdmap = PreparelDDTGlobalRDMap(references,
                               cutoffs=cutoffs,
                               sequence_separation=settings.sequence_separation,
                               radius=settings.radius)
#
# This part is optional and it depends on our settings parameter
if structural_checks:
    stereochemical_parameters = ReadStereoChemicalPropsFile()
    CheckStructure(ent=model_view,
                   bond_table=stereochemical_parameters.bond_table,
                   angle_table=stereochemical_parameters.angle_table,
                   nonbonded_table=stereochemical_parameters.nonbonded_table,
                   bond_tolerance=bond_tolerance,
                   angle_tolerance=angle_tolerance)
#
# Check consistency
is_cons = ResidueNamesMatch(model_view, references[0], True)
print("Consistency check: ", "OK" if is_cons else "ERROR")
#
# Calculate lDDT
LocalDistDiffTest(model_view,
                  references,
                  rdmap,
                  settings)
#
# Get the local scores
local_scores = GetlDDTPerResidueStats(model_view,
                                      rdmap,
                                      structural_checks,
                                      settings.label)
#
# Pring local scores
PrintlDDTPerResidueStats(local_scores, structural_checks, len(cutoffs))
dirkwalther commented 11 months ago

Dear apayne97, I fully agree. Unbelievable that the original paper does not contain a rigorous mathematical definition. Irritating to realize that peer-review did not lead to it either. From reading it I have come to the same interpretation as you though. Best, Dirk

Ieremie commented 10 months ago

Alphafold source code contains the implementation for the LDDT measure!

apayne97 commented 9 months ago

From reading it I have come to the same interpretation as you though. Best, Dirk

Thanks for double-checking @dirkwalther !

Alphafold source code contains the implementation for the LDDT measure!

@Ieremie could you point me to this? weirdly I have been having trouble finding it. thanks!

Ieremie commented 9 months ago

@apayne97

lddt-implementation

This is how I use it, maybe it is of help.


target_struct = parser.get_structure("a", f"{root}structures/{prot_id}.pdb")[0]
target_coord = np.array([res["CA"].get_coord() for res in target_struct.get_residues()])
target_coord = target_coord[None, :, :]
# we want to look at the entire target
true_points = np.ones((1,target_coord.shape[1], 1))

# masking residues that have not been structurally determined but appear in the seq
seq_from_struct = surface_util.get_seq_from_struct(target_struct)
mask = surface_util.get_alignment_mask(fasta_dict[prot_id].seq, ''.join(seq_from_struct))

pred_struct = parser.get_structure("a", pred_struct_name)[0]
pred_coord = np.array([res["CA"].get_coord() for res in pred_struct.get_residues()])

# select only the predicted structure for which we have target data
pred_coord = pred_coord[mask]
pred_coord = pred_coord[None, :, :]

LDDTs.append(np.array(lddt(pred_coord, target_coord, true_points))[0] * 100)```
jomimc commented 9 months ago

Hi there, long-time lurker here. (hi to Irfan and Richard if you see this)

I am currently putting together a protein structure analysis package -- protein deformation analysis -- I have of course shamelessly copied the package name from MDAnalysis.

I was wondering if it would be useful to add this functionality to the MDAKit? It'd be great if I could get my student to work on it.

orbeckst commented 9 months ago

Hi @jomimc , PDAnalysis looks very interesting. As it is, it unfortunately does not meet the criteria for inclusion in the MDAKit registry because your tool does not use MDAnalysis, at least judging from your requirements.txt/setup.py. (Maybe in the future you consider replacing the Biopython parser with MDAnalysis so that your package can work with more formats and perhaps even trajectories?)

If you want to make the IDDT function (and others) available as a separate MDAKit then that would be fantastic!! The MDAKits website contains a tutorial for how to get started.

jomimc commented 9 months ago

Hi @orbeckst, PDAnalysis is written mainly for analyzing AlphaFold and Protein Data Bank structures, hence using Biopython which is more oriented towards that. However I'm currently using MDAnalysis for analyzing protein trajectories, and I can definitely see a use for our new deformation metrics (and LDDT) there. In particular, the reason for using local deformation metrics is that RMSD and RMSF use global structural alignments, which do not work well with large-scale conformational change that is typical for some proteins over long timescales.

I'll have a look at the MDAKit tutorial and see how much work is involved. Thanks.

apayne97 commented 9 months ago

@Ieremie

this is fantastic, thank you!