DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
34 stars 10 forks source link

Compare deeprank-core and deeprank features #279

Closed gcroci2 closed 1 year ago

gcroci2 commented 1 year ago

Features that we currently have in deeprank-core, with the nomenclature in the hdf5 files.

Questions:

LilySnow commented 1 year ago

I did not find the feature list in the Nat. Comm. paper... The feature list was printed out in the log file. The log file was not backed up. So the easiest way I think is to run the example of DeepRank-CNN for "A . Generate the data set (using MPI)": https://github.com/DeepRank/deeprank. Make sure to set "compute_features='all' "

cbaakman commented 1 year ago

I remember that the Deeprank-CNN has the following features:

Possibly others

cbaakman commented 1 year ago

I recently noticed that the original deeprank stores features per chain. Examples:

@LilySnow Does this need to be done like this in deeprank-core as well?

DaniBodor commented 1 year ago

I did not find the feature list in the Nat. Comm. paper... The feature list was printed out in the log file. The log file was not backed up. So the easiest way I think is to run the example of DeepRank-CNN for "A . Generate the data set (using MPI)": https://github.com/DeepRank/deeprank. Make sure to set "compute_features='all' "

I tried running this script, but compute_features='all' is not a valid input. Instead, I used all the feature module names I found in the features folder, assuming there were no other feature generating modules anywhere else.

compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.BSA',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.ResidueDensity']

I uploaded the hdf5 file to this branch: https://github.com/DeepRank/deeprank-core/tree/279_compare_cnn_features_dbodor/tests/data/hdf5

The following features are present in the hdf5 file:

In addition, there is a separate group called "features_raw", which contains the same named features as the "features" group, but with different and differently shaped values.

Finally, there is a group called "mapped_features", which has all above mentioned features as well, but in duplicate, once per chain. In addition, mapped features has an "AtomicDensities" group with 8 subgroups: C, N, O, and S for each chain.

Finally, note that I did not look at the content of the features, just at their presence/absence.

DaniBodor commented 1 year ago

@cbaakman and @LilySnow, could you please take a look at my comment above and let me know:

cbaakman commented 1 year ago

I don't know the answers to most of these questions. But I always understood that features_raw was no longer used. Also, atomic densities seem to be a feature that should be available, but it can not be mapped to the grid with the current deeprank-core code.

DaniBodor commented 1 year ago

Looking further at the RCD features in the ResidueDensity.py module of deeprank (CNN), it appears that this feature is just a count of the number of other residues within a certain cutoff radius; either the total number or the number of a certain polarity.

It shouldn't be too difficult to add a feature to deeprankcore to count these as well. I'll open an issue for that (#331).

DaniBodor commented 1 year ago

Also, atomic densities seem to be a feature that should be available, but it can not be mapped to the grid with the current deeprank-core code.

If I understood @DarioMarzella correctly, these represent the proportion of the voxel that is occupied by each atom. Is that right? If so, I assume that the "Feature_Ind" group is the equivalent of that, where the features are mapped onto the grid.

If all this is true, then I believe the current code can and does actually create the atomic denisty features. To test it quickly, I created an atom level grid (adjusting the integration test was the quickest/easiest way to do it) with all the features. I believe that the the "atom_type_000" (etc) features of this file are equivalent to the "AtomicDensity" features from the original deeprank hdf5 file, except that they are one-hot encoded instead of named by atom type.

Can you confirm this please, @cbaakman.

cbaakman commented 1 year ago

I made a script to compare the data:

import h5py
import numpy as np
from typing import Tuple

def _inflate(index: np.array, value: np.array, shape: Tuple[int]):

    data = np.zeros(shape[0] * shape[1] * shape[2])
    data[index] = value[:,0]

    return data.reshape(shape)

def test_compare():

    with h5py.File("deeprank_CNN.hdf5", 'r') as f5: 

        chain1_c_group = f5["1AK4_cm-it0_745/mapped_features/AtomicDensities_ind/C_chain1"]
        chain1_c_index = chain1_c_group["index"][:]
        chain1_c_value = chain1_c_group["value"][:]

        old_chain1_c_data = _inflate(chain1_c_index, chain1_c_value, (30, 30, 30))

        chain2_c_group = f5["1AK4_cm-it0_745/mapped_features/AtomicDensities_ind/C_chain2"]
        chain2_c_index = chain2_c_group["index"][:]
        chain2_c_value = chain2_c_group["value"][:]

        old_chain2_c_data = _inflate(chain2_c_index, chain2_c_value, (30, 30, 30))

    with h5py.File("grid_atomic.hdf5", 'r') as f5: 
        new_c_data = f5["atom-ppi-1ATN_1w:A-B/mapped_features/atom_type_000/value"][:]

    assert np.all(np.abs((old_chain1_c_data + old_chain2_c_data) - new_c_data) < 0.001), "not the same"

However, it turns out that the two datasets do not have the same grid box settings. So comparing is impossible. I do think that your theory is correct. Where is your code located?

DaniBodor commented 1 year ago

Where is your code located?

You mean to create the hdf5 file? Honestly, I did a quick & dirty adjustment of the test_integration_cnn() function and didn't save it. I just changed the prefix to a local folder so that it wouldn't get deleted by rmtree and just the query to atomic instead of residue.

For the hdf5 file from deeprank (not core), I followed the readme instructions

LilySnow commented 1 year ago

Not sure whether we need atomic density in fact, because when we map the one-hot encoded feature for each atom to grid, we already used Gaussian. I think Atomic density used in DeepRank-CNN are defined here: Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., & Koes, D. R. (2017). Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling, 57(4), 942-957.

DaniBodor commented 1 year ago

Not sure whether we need atomic density in fact, because when we map the one-hot encoded feature for each atom to grid, we already used Gaussian.

Do you mean we don't need it as a separate feature from the current "atomtype" features, or do you mean that we don't need either?

The first I agree, because they are the same thing. The second I don't see why we wouldn't want the atom type feature.

I also don't understand what you mean by "we already used Gaussian". Is that the way to map the features onto the grid?