Make features' names clearer

DTRademaker commented 2 years ago

The GNN features are sometimes residue-specific features and sometimes atom-specific. It would be nice if the feature name included something like 'atom' or 'aa'.

For example, the feature type for the atom-based gnn is a residue feature. We could call it res_type.
Any other unclear features' names? (see deeprankcore.domain.feature.py)

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

DaniBodor commented 2 years ago

why not unify the naming such that FEATURENAME_XXXX = "xxxx" is always true? Wouldn't that create more clarity and less potential for mistakes?

gcroci2 commented 2 years ago

We are evaluating which is the clearest way to organize feature names, both in the code and in the hdf5 file. As @DaniBodor mentioned above, one solution could be to align their code definition and hdf5 definition.

First we're going to solve issues #148 and #147.

DaniBodor commented 2 years ago

My suggestion for the new feature names (you can also check them in this commit.

@cbaakman @gcroci2, can you give me some feedback on whether this is all correct and clear.

# node features

## generic features
FEATURE_NODE_MAINCHAIN = "main_chain" # FEATURENAME_CHAIN = "chain" : bool. --> I think unused. It is checked twice, but never assigned.
FEATURE_NODE_COORDINATES = "coordinates" # FEATURENAME_POSITION = "pos": list[3xfloat]

## residue core features
FEATURE_NODE_RESIDUE = "residue" # FEATURENAME_AMINOACID = "type": AminoAcid object     
FEATURE_NODE_CHARGE = "charge" # FEATURENAME_CHARGE = "charge" : float(<0) --> currently unused, any particular reason?
FEATURE_NODE_POLARITY = "polarity" # FEATURENAME_POLARITY = "polarity" : Polarity object
FEATURE_NODE_RESIDUESIZE = "residue_size" # FEATURENAME_SIZE = "size" : int --> called it residue size in case we will also have an atom size in the future
FEATURE_NODE_HBONDDONORS = "hbond_donors" # FEATURENAME_HYDROGENBONDDONORS = "hb_donors" : int
FEATURE_NODE_HBONDACCEPTORS = "hbond_acceptors"# FEATURENAME_HYDROGENBONDACCEPTORS = "hb_acceptors" : int

## conservation features
FEATURE_NODE_PSSM = "pssm" # FEATURENAME_PSSM = "pssm": list[20xint]
FEATURE_NODE_INFORMATIONCONTENT = "information_content" # FEATURENAME_INFORMATIONCONTENT = "ic" : float
FEATURE_NODE_CONSERVATION = "conservation" # FEATURENAME_PSSMWILDTYPE = "pssm_wildtype" : int

## protein context features
FEATURE_NODE_BURIEDSURFACEAREA = "buried_surface_area" # FEATURENAME_BURIEDSURFACEAREA = "bsa": float
FEATURE_NODE_HALFSPHEREEXPOSURE = "half_sphere_exposure" # FEATURENAME_HALFSPHEREEXPOSURE = "hse":  list[3xfloat] 
FEATURE_NODE_RESIDUEDEPTH = "residue_depth" # FEATURENAME_RESIDUEDEPTH = "depth" : float
FEATURE_NODE_SASA = "sasa" # FEATURENAME_SASA = "sasa" : float

## variant features
FEATURE_NODE_VARIANTRESIDUE = "variant_residue" # FEATURENAME_VARIANTAMINOACID = "variant": AminoAcid object
FEATURE_NODE_DIFFERENCESIZE = "difference_size" # FEATURENAME_SIZEDIFFERENCE = "size_difference" : int
FEATURE_NODE_DIFFERENCEPOLARITY = "difference_polarity" # FEATURENAME_POLARITYDIFFERENCE = "polarity_difference"
FEATURE_NODE_DIFFERENCEHBONDDONORS = "difference_hbond_donors" # FEATURENAME_HYDROGENBONDDONORSDIFFERENCE = "hb_donors_difference"
FEATURE_NODE_DIFFERENCEHBONDACCEPTORS = "difference_hbond_acceptors" # FEATURENAME_HYDROGENBONDACCEPTORSDIFFERENCE = "hb_acceptors_difference"
FEATURE_NODE_DIFFERENCECHARGE = "difference_charge" # if we include CHARGE, it probably makes sense to include this one too
FEATURE_NODE_VARIANTCONSERVATION = "variant_conservation" # FEATURENAME_PSSMVARIANT = "pssm_variant" : int  --> isn't this information basically covered by FEATURENAME_NODE_CONSERVATIONDIFFERENCE? If not, then we should maybe also include VARIANT versions for: SIZE, POLARITY, HBONDDONORS, HBONDACCEPTORS
FEATURE_NODE_DIFFERENCECONSERVATION = "difference_conservation" # FEATURENAME_PSSMDIFFERENCE = "pssm_difference" : int
# FEATURENAME_CONSERVATION = "conservation" # int; unused; i think superceded by FEATURENAME_PSSMVARIANT
# FEATURENAME_CONSERVATIONDIFFERENCE = "conservation_difference" # unused, i think superceded by: FEATURENAME_PSSMDIFFERENCE

# edge features
FEATURE_EDGE_COVALENT = "covalent" # FEATURENAME_COVALENT = "covalent" : bool
FEATURE_EDGE_ELECTROSTATICPOTENTIAL = "electrostatic_potential" # FEATURENAME_EDGECOULOMB = "coulomb"
FEATURE_EDGE_LJPOTENTIAL = "lj_potential" # FEATURENAME_EDGEVANDERWAALS = "vanderwaals"
FEATURE_EDGE_DISTANCE = "distance" # FEATURENAME_EDGEDISTANCE = "dist"
# FEATURE_EDGE_INTERACTIONTYPE = "interaction_type" # FEATURENAME_EDGETYPE = "type" --> replace by CIS/SAMECHAIN
FEATURE_EDGE_CIS = "cis" # FEATURENAME_EDGESAMECHAIN = "same_chain" # bool --> unused; I think superceded by INTERACTIONTYPE, but this parameter makes more sense to me

# ## edge types --> use EDGE_CIS instead?
# FEATURE_EDGE_CISINTERACTION = "cis_interaction" # EDGETYPE_INTERNAL = "internal"
# FEATURE_EDGE_TRANSINTERACTION = "trans_interaction"# EDGETYPE_INTERFACE = "interface"

DaniBodor commented 2 years ago

Just highlighting/explaining a few questions from the comments in the code.

CHARGE: this feature is not currently used; any particular reason for that? If we will start using it, does it also make sense to include a DIFFERENCECHARGE feature for variants?
PSSMVARIANT: for all other variant features, only the difference to the wildtype is used and not the variant feature itself. Any particular reason why? Should we only do difference here as well or should we include the actual feature as well as the difference for all variant features?
CONSERVATION/CONSERVATIONDIFFERENCE: Am I right that these are outdated versions of PSSMVARIANT/PSSMDIFFERENCE?
It looks like SAMECHAIN was replaced by EDGETYPE at some point. Given that EDGETYPE has a very different structure than the rest of the features, I suggest to replace both by something like CIS with a boolean input: either it is or isn't a cis-interaction (similar to COVALENT). Does this make sense?
some features appear to be boolean but either set to 0/1 or 0.0/1.0 (e.g. COVALENT). Is there any reason these cannot be actual True/False-booleans instead?

gcroci2 commented 2 years ago

To me they look much better now! I have only minor comments/confirmations:

FEATURENAME_CHAIN, FEATURENAME_CHARGE, FEATURENAME_CONSERVATION, FEATURENAME_CONSERVATIONDIFFERENCE, FEATURE_EDGE_CIS, FEATURE_EDGE_CISINTERACTION, FEATURE_EDGE_TRANSINTERACTION
- I would remove them
FEATURE_NODE_RESIDUE = "residue"
- I'd call it FEATURE_NODE_RESTYPE = "res_type"
FEATURE_EDGE_ELECTROSTATICPOTENTIAL = "electrostatic_potential"
- I'd use a shorter nomenclature

some features appear to be boolean but either set to 0/1 or 0.0/1.0 (e.g. COVALENT). Is there any reason these cannot be actual True/False-booleans instead?

Yes, I remember that storing a boolean in hdf5 file could be problematic, and using directly numerical values is handier to directly have the numerical feature ready for the neural network.

PSSMVARIANT: for all other variant features, only the difference to the wildtype is used and not the variant feature itself. Any particular reason why? Should we only do difference here as well or should we include the actual feature as well as the difference for all variant features?

No idea, I think @cbaakman can help to answer this.

deeprankcore/tools/visualization.py module seems outdated (e.g. it uses features such as FEATURENAME_CHAIN that are never created). Should we update it or remove it? As it is now, if I understood correctly, it's not usable. @cbaakman

DeepRank / deeprank2

Make features' names clearer #145