biocryst / biozernike

Protein structure descriptors and alignment based on 3D Zernike moments.
MIT License
33 stars 11 forks source link

Calculating moments between two PDBs #20

Open thorstenwagner opened 1 year ago

thorstenwagner commented 1 year ago

Hi!

Thanks for the lib :-)

I would like to calculate a similarity score between two PDBs (like rcsb but for any protein combination) . Is there any documentation how to do that with this library?

Best, Thorsten

biocryst commented 12 months ago

Hi Thorsten,

The simplest way is to use InvariantNorm.compareInvariants, as in the example here: https://github.com/biocryst/biozernike/blob/bb21b6f59fce4ca82276d6d601df7ef38f74b849/src/test/java/org/rcsb/biozernike/DescriptorTest.java#L163C35-L163C35

It compares moments between a coordinate structure and an EM one, but comparing two PDB structures would be the same. You will get a distance score that will negatively correlate with the rcsb's similarity score. The correlation will not be perfect though, as that one is a weighted distance, with weights trained on a set of assemblies.

Sample weights can be found here https://github.com/biocryst/biozernike/blob/master/src/test/resources/descriptor.properties (these were trained on CATH subsets if I remember correctly). This file is used to initialise the Descriptor class, but I just realised that we're not really showing how to use the Descriptor class in this library beyond initialisation.

Basically using all moments (as in compareInvariants) is a decent starting point, but you can tune the distance measure by fitting the moment weights to maximise retrieval metrics on your dataset, if needed.

thorstenwagner commented 12 months ago

Thank you for your response :-)

So, that means I could also use a molmap (mrc) to calculate the similarity? That would be even better, because I actually want to know how similar two molmaps (10A) are which I converted from PDB.

I think this code from your test would read the map?

InputStream is = DescriptorTest.class.getResourceAsStream("/emd_3186.map");
        Volume volumeEM = VolumeIO.read(is, MapFileType.MRC);
        volumeEM.positivize();
        volumeEM.applyContourAndNormalize(0.0176, 757);
        volumeEM.updateCenter();
        volumeEM.setRadiusVarMult(1.64);

There are some constants you set. Can you comment on setRadiusVarMult?

biocryst commented 12 months ago

The descriptors are based on volumes (maps), so yeah you could use maps directly. There are a few things to keep in mind, resulting from the fact that my main workflow was coordinates->simulated density->descriptors

If the maps you are comparing are coming from the PDB coordinates, as you mentioned, then I guess you generated them in a similar manner (+/- how the Volume::create(..) does it), then you don't need to worry about different normalisations and radii corrections. But for arbitrary EM maps the consistency of the densities needs to be validated first before calculating/comparing descriptors.