Open arogozhnikov opened 2 months ago
@aozalevsky @brindakv recommendations are very welcome
afaik, direct support of _ma_qa_metric
is implemented in Mol* (which you already know since you're using it in your web service) and ChimeraX. In Mol* pLDDT is automatically recognized as one of the Validation
attributes and colored according to pLDDT Confidence
colorscheme. In ChimeraX you can color a model using the following command:
color by pLDDT_score atoms palette alphafold
where pLDDT
comes from a _ma_qa_metric
and _score
is added later in the code as a label.
All scripts for pLDDT coloring for PyMOL rely on the _atom_site.B_iso_or_equiv
(or simply B-factors) field since it's been a common (though hacky) practice. DeepMind is also populating both _ma_qa_metric
and _atom_site.B_iso_or_equiv
for backwards compatibility. Seems that you also opted for a similar strategy.
As for PyMOL, seems like only the minimal useful information from PDBx/mmCIF
and BinaryCIF
is extracted. I wonder if @JarrettSJohnson would be willing to chime in. Would be really nice to have at least per-residue (local) scores from _ma_qa_metric
supported in all major visualization packages.
As for PAE, it's already supported in the ModelCIF as _ma_qa_metric_local_pairwise
, but judging from the repos, none of the aforementioned tools support reading it from the CIF file. ChimeraX already has PAE visualization (as downloaded from AFDB) tools in place. Maybe @tomgoddard can tell if the current _ma_qa_metric
metric support can be further extended for reading and visualizing pairwise (PAE) scores.
The low-level support for ModelCIF (including reading/writing metrics) is already implemented in the https://github.com/ihmwg/python-modelcif
Showing _ma_qa_metric_local_pairwise in ChimeraX is not related to the capability added by @e-pettersen in April 2023 to show per-residue attributes from _ma_qa_metric since a residue-pair score would be handled entirely differently than a score on individual residues. But I have used the AlphaFold PAE plot to show residue-residue distances as shown here
https://rbvi.github.io/chimerax-recipes/rrdist/rrdist.html
That code just rewrote the distances in AlphaFold PAE JSON file format and read the file. I could possibly try some Python code in ChimeraX that reads the _ma_qa_metric_local_pairwise table from a ModelCIF entry and show it with the AlphaFold PAE plot in ChimeraX. Could you point me to a ModelCIF file that has that table? The dictionary description of the table suggests it has all the information needed to display like PAE data.
@arogozhnikov @tomgoddard i took one of the examples from the Chai-1 webserver and generated a CIF file with embedded PAEs.
There are a few comments, though:
my example has only protein-protein PAEs. The reason is that models.qa_metric.LocalPairwise
is defined on the per-residue level, but as @arogozhnikov pointed out, they use per-atom PAEs for ligands and non-canonical amino acids/nucleotides. I've opened the issue https://github.com/ihmwg/python-modelcif/issues/38
there are no dictionary requirements/definitions related to the format of PAEs. Technically, a PAE matrix should be a symmetric square matrix. Still, at the moment, there are no guidelines on whether the full matrix or only the upper triangular part has to be populated. It also seems that only selected scores could populated (as in my example - i ignored protein-ligand PAEs). To continue the discussion I've created https://github.com/ihmwg/ModelCIF/issues/19 I
The ChimeraX PAE plotting for AlphaFold 2 and AlphaFold 3 assumes that the matrix rows and columns include all residues in order and for AlphaFold 3 all atoms for all non-standard residues. I believe in AlphaFold 3 the PAE files contain residues numbers, chain ids, not sure about atom names, for each PAE row, but ChimeraX is not using those because the ChimeraX code was written for early AlphaFold 2 PAE files that had none of that information labeling the matrix rows. Also the PAE matrix is not symmetric in AlphaFold since the PAE value is defined by aligning the row residue/atom and then giving an error for the placement of the column residue/atom, so it is by definition not symmetric. So ChimeraX has no provision to handle just one triangle of a symmetric matrix. All these ChimeraX PAE limitations could be fixed for ModelCIF. My purpose in this comment is just to explain how ChimeraX PAE plotting is currently based entirely on AlphaFold (and ESMFold) PAE conventions.
First of all let me state that it is great to see the interest in properly storing the quality metrics using ModelCIF. Given my involvement in ModelCIF and ModelArchive (MA), I can hopefully provide some useful input here.
_ma_qa_metric.type
and _ma_qa_metric.mode
and new categories for per-chain, per-chain-pair, per-atom, and per-atom-pair scores. For the latter, I am uncertain how to best handle per-residue tokens and would be happy for your input. My suggestion would be to simply only use the "token centre atom" (as in 2.6 of Supplementary information for AlphaFold 3; CA for standard amino acids, C1' for standard nucleotides) for per-residue tokens and hence only sparsely populate the per-atom-pair scores table. The alternative would require a mix of per-atom-pair and per-residue-pair data which I think would be more confusing I think.ChimeraX plots the AlphaFold 3 PAE scores and handles its combination of residue and atom level values (one PAE value per residue for standard amino and nucleic acids, and one PAE value per atom for all other atoms). Specifically it knows that the per-residue score is a residue score, not just a score for the C-alpha atom. If you treat it as a C-alpha atom score you lose some knowledge about the meaning of the score. I agree it is messy to handle though.
I made an example ChimeraX command to plot pairwise residue scores included directly in ModelCIF structure files described here
https://rbvi.github.io/chimerax-recipes/modelcif_pae/modelcif_pae.html
Yeah the "meaning of AlphaFold 3 PAE scores" is a tricky one here. In terms of what is being predicted, it relates only to the CA/C1' atoms for standard amino and nucleic acids. But: I could make the same argument about per-residue pLDDT and PAE in AF2. In terms of interpretation of the score (i.e. making it useful), I of course agree that PAE values for standard amino and nucleic acids relate to the whole residue and not just to CA/C1' atoms.
Anyway I updated my suggested update of ModelCIF (https://github.com/ihmwg/ModelCIF/issues/21) with an idea how one could represent QE metrics which are listed in separate ModelCIF tables. I.e. the same PAE metric would then have values in _ma_qa_metric_per_atom_pairwise
and _ma_qa_metric_local_pairwise
. From the perspective of the PAE writers and parsers this option would then also mean that data needs to be written in separate tables in separate ways for per-residue-pair and per-atom-pair PAEs.
A simple idea for making the mmCIF tables handle pairwise scores between atoms and residues is to add two atom name fields to the _ma_qa_metric_local_pairwise table
https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Categories/ma_qa_metric_local_pairwise.html
If the atom name was given as "." then that would mean the score applies to the whole residue, while if an atom is named e.g. "CA" then the score is associated with that atom. This allows a pair score to be between an atom and a residue as it is done in AlphaFold 3. I think it will be important to handle AlphaFold 3 PAE scores. They will probably be the most heavily used pairwise scores in the next few years.
If the atom name was given as "." then that would mean the score applies to the whole residue, while if an atom is named e.g. "CA" then the score is associated with that atom. This allows a pair score to be between an atom and a residue as it is done in AlphaFold 3. I think it will be important to handle AlphaFold 3 PAE scores. They will probably be the most heavily used pairwise scores in the next few years.
we have a somewhat similar concept in ihm for crosslinks (which are also, typically, are pairwise restraints) _ihm_cross_link_restraint.model_granularity
. Theoretically, similar modifiers could be applied for both ends of a restraint/score. However, maintaining the by-feature
level (basically any custom selection other than standard things like an atom, full residue, or a full asym_unit) is tricky and requires multiple additional tables.
Agree with the above. Supporting AF3 PAE will be important and we need to make sure that whatever we implement can handle pairs between single atoms and full residues. I will update the suggestion in https://github.com/ihmwg/ModelCIF/issues/21 accordingly. One worry of adding more columns to ma_qa_metric_local_pairwise
is that it makes it more bulky while a pure per-atom table could be linked to atom_site.id
instead. Additionally the current per-residue identification would not be suitable to handle QE for water molecules (unless we deviate from PDB standards and give each water molecule a separate label_asym_id
) which are relevant for physics-based docking tools.
We recently switched from PDB > CIF with two main motivations:
We followed recommendation to use
modelcif
, and specifically_ma_qa_metric
, but I don't see a way to color by this field in pymol; and don't see pairwise metrics in pymol either.I need help figuring out what's the right pipeline for users.