Compare deeprank-core and deeprank features

gcroci2 commented 1 year ago

Features that we currently have in deeprank-core, with the nomenclature in the hdf5 files.

Edge features (edge_features Group in the hdf5 files):
- metafeatures: "_name", "_index"
- generic features: "same_chain", "same_res", "distance"
- interactions: "covalent", "electrostatic", "vanderwaals"
Node features (node_features Group in the hdf5 files):
- metafeatures: "_name", "_chain_id", "_position"
- residue core features: "res_type", "charge", "polarity", "res_size", "res_mass", "res_pI", "hb_donors", "hb_acceptors"
- variant residue features: "variant_res", "diff_charge", "diff_size", "diff_mass", "diff_pI", "diff_polarity", "diff_hb_donors", "diff_hb_acceptors", "diff_conservation"
- protein context features: "bsa", "hse", "sasa", "res_depth"
- conservation features: "pssm", "info_content", "conservation"
- atom core features: "atom_type", "atom_charge", "pdb_occupancy", "vdw_parameters"

Questions:

Which ones are not needed for grids?
Which ones are missing from the original deeprank for grids?

LilySnow commented 1 year ago

I did not find the feature list in the Nat. Comm. paper... The feature list was printed out in the log file. The log file was not backed up. So the easiest way I think is to run the example of DeepRank-CNN for "A . Generate the data set (using MPI)": https://github.com/DeepRank/deeprank. Make sure to set "compute_features='all' "

cbaakman commented 1 year ago

I remember that the Deeprank-CNN has the following features:

electrostatic
vanderwaals
charge
bsa
pssm, info_content, conservation
atomic densities per atom-type

Possibly others

cbaakman commented 1 year ago

I recently noticed that the original deeprank stores features per chain. Examples:

vdwaals_chain1
vdwaals_chain2

@LilySnow Does this need to be done like this in deeprank-core as well?

DaniBodor commented 1 year ago

I did not find the feature list in the Nat. Comm. paper... The feature list was printed out in the log file. The log file was not backed up. So the easiest way I think is to run the example of DeepRank-CNN for "A . Generate the data set (using MPI)": https://github.com/DeepRank/deeprank. Make sure to set "compute_features='all' "

I tried running this script, but compute_features='all' is not a valid input. Instead, I used all the feature module names I found in the features folder, assuming there were no other feature generating modules anywhere else.

compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.BSA',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.ResidueDensity']

I uploaded the hdf5 file to this branch: https://github.com/DeepRank/deeprank-core/tree/279_compare_cnn_features_dbodor/tests/data/hdf5

The following features are present in the hdf5 file:

bsa, charge, coulomb, pssm_ic, vdwaals
- all of these also exist in deeprankcore
PSSM (x20)
- 1 per amino acid
- in deeprankcore we have a single PSSM feature that contains all of these in one-hot encoding
RCD (x7)
- total + 6 combinations of polar/apolar/charged
- I believe RCD stands for residue contact density
- I don't know whether we have an equivalent of this in deeprankcore (see features listed by Giulia above)

In addition, there is a separate group called "features_raw", which contains the same named features as the "features" group, but with different and differently shaped values.

I since found out that these are "human readable versions" of the features.

Finally, there is a group called "mapped_features", which has all above mentioned features as well, but in duplicate, once per chain. In addition, mapped features has an "AtomicDensities" group with 8 subgroups: C, N, O, and S for each chain.

I don't think we have an equivalent for this in deeprankcore. Is it needed?

Finally, note that I did not look at the content of the features, just at their presence/absence.

DaniBodor commented 1 year ago

@cbaakman and @LilySnow, could you please take a look at my comment above and let me know:

[x] Did I catch all the features using the 5 modules I listed above, or are there others I missed?
- I did a search for FeatureClass and didn't find this anywhere else, so am fairly confident that I caught everything.
[x] Re: RCD features
- ~~what are these precisely~~
- ~~do we have an equivalent of them current deeprank (see top comment by Giulia for full list)~~
- ~~if not, do we need them in deeprankcore as well?~~
- I opened an issue (#331) to add these
[x] Re: AtomicDensities group in mapped_features
- ~~what are these precisely~~
- are these covered by "atom_type" features of deeprankcore (see comments below)?
- if not, how do we implement them in deeprankcore?
[x] Is the difference between "features" and "features_raw" something we need to implement in deeprankcore?
- ~~If so, please explain what these are~~
- the raw features are "human readable" equivalents, so not needed
[ ] Do we need to do a careful comparison of the content/shape of the different features in addition to just looking at their existence?
[ ] Any other comment on this or something I missed?

cbaakman commented 1 year ago

I don't know the answers to most of these questions. But I always understood that features_raw was no longer used. Also, atomic densities seem to be a feature that should be available, but it can not be mapped to the grid with the current deeprank-core code.

DaniBodor commented 1 year ago

Looking further at the RCD features in the ResidueDensity.py module of deeprank (CNN), it appears that this feature is just a count of the number of other residues within a certain cutoff radius; either the total number or the number of a certain polarity.

It shouldn't be too difficult to add a feature to deeprankcore to count these as well. I'll open an issue for that (#331).

DaniBodor commented 1 year ago

Also, atomic densities seem to be a feature that should be available, but it can not be mapped to the grid with the current deeprank-core code.

If I understood @DarioMarzella correctly, these represent the proportion of the voxel that is occupied by each atom. Is that right? If so, I assume that the "Feature_Ind" group is the equivalent of that, where the features are mapped onto the grid.

If all this is true, then I believe the current code can and does actually create the atomic denisty features. To test it quickly, I created an atom level grid (adjusting the integration test was the quickest/easiest way to do it) with all the features. I believe that the the "atom_type_000" (etc) features of this file are equivalent to the "AtomicDensity" features from the original deeprank hdf5 file, except that they are one-hot encoded instead of named by atom type.

Can you confirm this please, @cbaakman.

cbaakman commented 1 year ago

I made a script to compare the data:

import h5py
import numpy as np
from typing import Tuple

def _inflate(index: np.array, value: np.array, shape: Tuple[int]):

    data = np.zeros(shape[0] * shape[1] * shape[2])
    data[index] = value[:,0]

    return data.reshape(shape)

def test_compare():

    with h5py.File("deeprank_CNN.hdf5", 'r') as f5: 

        chain1_c_group = f5["1AK4_cm-it0_745/mapped_features/AtomicDensities_ind/C_chain1"]
        chain1_c_index = chain1_c_group["index"][:]
        chain1_c_value = chain1_c_group["value"][:]

        old_chain1_c_data = _inflate(chain1_c_index, chain1_c_value, (30, 30, 30))

        chain2_c_group = f5["1AK4_cm-it0_745/mapped_features/AtomicDensities_ind/C_chain2"]
        chain2_c_index = chain2_c_group["index"][:]
        chain2_c_value = chain2_c_group["value"][:]

        old_chain2_c_data = _inflate(chain2_c_index, chain2_c_value, (30, 30, 30))

    with h5py.File("grid_atomic.hdf5", 'r') as f5: 
        new_c_data = f5["atom-ppi-1ATN_1w:A-B/mapped_features/atom_type_000/value"][:]

    assert np.all(np.abs((old_chain1_c_data + old_chain2_c_data) - new_c_data) < 0.001), "not the same"

However, it turns out that the two datasets do not have the same grid box settings. So comparing is impossible. I do think that your theory is correct. Where is your code located?

DaniBodor commented 1 year ago

Where is your code located?

You mean to create the hdf5 file? Honestly, I did a quick & dirty adjustment of the test_integration_cnn() function and didn't save it. I just changed the prefix to a local folder so that it wouldn't get deleted by rmtree and just the query to atomic instead of residue.

For the hdf5 file from deeprank (not core), I followed the readme instructions

LilySnow commented 1 year ago

Not sure whether we need atomic density in fact, because when we map the one-hot encoded feature for each atom to grid, we already used Gaussian. I think Atomic density used in DeepRank-CNN are defined here: Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., & Koes, D. R. (2017). Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling, 57(4), 942-957.

DaniBodor commented 1 year ago

Not sure whether we need atomic density in fact, because when we map the one-hot encoded feature for each atom to grid, we already used Gaussian.

Do you mean we don't need it as a separate feature from the current "atomtype" features, or do you mean that we don't need either?

The first I agree, because they are the same thing. The second I don't see why we wouldn't want the atom type feature.

I also don't understand what you mean by "we already used Gaussian". Is that the way to map the features onto the grid?

DeepRank / deeprank2

Compare deeprank-core and deeprank features #279