Closed DTRademaker closed 2 years ago
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
why not unify the naming such that FEATURENAME_XXXX = "xxxx"
is always true? Wouldn't that create more clarity and less potential for mistakes?
We are evaluating which is the clearest way to organize feature names, both in the code and in the hdf5 file. As @DaniBodor mentioned above, one solution could be to align their code definition and hdf5 definition.
First we're going to solve issues #148 and #147.
My suggestion for the new feature names (you can also check them in this commit.
@cbaakman @gcroci2, can you give me some feedback on whether this is all correct and clear.
# node features
## generic features
FEATURE_NODE_MAINCHAIN = "main_chain" # FEATURENAME_CHAIN = "chain" : bool. --> I think unused. It is checked twice, but never assigned.
FEATURE_NODE_COORDINATES = "coordinates" # FEATURENAME_POSITION = "pos": list[3xfloat]
## residue core features
FEATURE_NODE_RESIDUE = "residue" # FEATURENAME_AMINOACID = "type": AminoAcid object
FEATURE_NODE_CHARGE = "charge" # FEATURENAME_CHARGE = "charge" : float(<0) --> currently unused, any particular reason?
FEATURE_NODE_POLARITY = "polarity" # FEATURENAME_POLARITY = "polarity" : Polarity object
FEATURE_NODE_RESIDUESIZE = "residue_size" # FEATURENAME_SIZE = "size" : int --> called it residue size in case we will also have an atom size in the future
FEATURE_NODE_HBONDDONORS = "hbond_donors" # FEATURENAME_HYDROGENBONDDONORS = "hb_donors" : int
FEATURE_NODE_HBONDACCEPTORS = "hbond_acceptors"# FEATURENAME_HYDROGENBONDACCEPTORS = "hb_acceptors" : int
## conservation features
FEATURE_NODE_PSSM = "pssm" # FEATURENAME_PSSM = "pssm": list[20xint]
FEATURE_NODE_INFORMATIONCONTENT = "information_content" # FEATURENAME_INFORMATIONCONTENT = "ic" : float
FEATURE_NODE_CONSERVATION = "conservation" # FEATURENAME_PSSMWILDTYPE = "pssm_wildtype" : int
## protein context features
FEATURE_NODE_BURIEDSURFACEAREA = "buried_surface_area" # FEATURENAME_BURIEDSURFACEAREA = "bsa": float
FEATURE_NODE_HALFSPHEREEXPOSURE = "half_sphere_exposure" # FEATURENAME_HALFSPHEREEXPOSURE = "hse": list[3xfloat]
FEATURE_NODE_RESIDUEDEPTH = "residue_depth" # FEATURENAME_RESIDUEDEPTH = "depth" : float
FEATURE_NODE_SASA = "sasa" # FEATURENAME_SASA = "sasa" : float
## variant features
FEATURE_NODE_VARIANTRESIDUE = "variant_residue" # FEATURENAME_VARIANTAMINOACID = "variant": AminoAcid object
FEATURE_NODE_DIFFERENCESIZE = "difference_size" # FEATURENAME_SIZEDIFFERENCE = "size_difference" : int
FEATURE_NODE_DIFFERENCEPOLARITY = "difference_polarity" # FEATURENAME_POLARITYDIFFERENCE = "polarity_difference"
FEATURE_NODE_DIFFERENCEHBONDDONORS = "difference_hbond_donors" # FEATURENAME_HYDROGENBONDDONORSDIFFERENCE = "hb_donors_difference"
FEATURE_NODE_DIFFERENCEHBONDACCEPTORS = "difference_hbond_acceptors" # FEATURENAME_HYDROGENBONDACCEPTORSDIFFERENCE = "hb_acceptors_difference"
FEATURE_NODE_DIFFERENCECHARGE = "difference_charge" # if we include CHARGE, it probably makes sense to include this one too
FEATURE_NODE_VARIANTCONSERVATION = "variant_conservation" # FEATURENAME_PSSMVARIANT = "pssm_variant" : int --> isn't this information basically covered by FEATURENAME_NODE_CONSERVATIONDIFFERENCE? If not, then we should maybe also include VARIANT versions for: SIZE, POLARITY, HBONDDONORS, HBONDACCEPTORS
FEATURE_NODE_DIFFERENCECONSERVATION = "difference_conservation" # FEATURENAME_PSSMDIFFERENCE = "pssm_difference" : int
# FEATURENAME_CONSERVATION = "conservation" # int; unused; i think superceded by FEATURENAME_PSSMVARIANT
# FEATURENAME_CONSERVATIONDIFFERENCE = "conservation_difference" # unused, i think superceded by: FEATURENAME_PSSMDIFFERENCE
# edge features
FEATURE_EDGE_COVALENT = "covalent" # FEATURENAME_COVALENT = "covalent" : bool
FEATURE_EDGE_ELECTROSTATICPOTENTIAL = "electrostatic_potential" # FEATURENAME_EDGECOULOMB = "coulomb"
FEATURE_EDGE_LJPOTENTIAL = "lj_potential" # FEATURENAME_EDGEVANDERWAALS = "vanderwaals"
FEATURE_EDGE_DISTANCE = "distance" # FEATURENAME_EDGEDISTANCE = "dist"
# FEATURE_EDGE_INTERACTIONTYPE = "interaction_type" # FEATURENAME_EDGETYPE = "type" --> replace by CIS/SAMECHAIN
FEATURE_EDGE_CIS = "cis" # FEATURENAME_EDGESAMECHAIN = "same_chain" # bool --> unused; I think superceded by INTERACTIONTYPE, but this parameter makes more sense to me
# ## edge types --> use EDGE_CIS instead?
# FEATURE_EDGE_CISINTERACTION = "cis_interaction" # EDGETYPE_INTERNAL = "internal"
# FEATURE_EDGE_TRANSINTERACTION = "trans_interaction"# EDGETYPE_INTERFACE = "interface"
Just highlighting/explaining a few questions from the comments in the code.
To me they look much better now! I have only minor comments/confirmations:
some features appear to be boolean but either set to 0/1 or 0.0/1.0 (e.g. COVALENT). Is there any reason these cannot be actual True/False-booleans instead?
Yes, I remember that storing a boolean in hdf5 file could be problematic, and using directly numerical values is handier to directly have the numerical feature ready for the neural network.
PSSMVARIANT: for all other variant features, only the difference to the wildtype is used and not the variant feature itself. Any particular reason why? Should we only do difference here as well or should we include the actual feature as well as the difference for all variant features?
No idea, I think @cbaakman can help to answer this.
deeprankcore/tools/visualization.py module seems outdated (e.g. it uses features such as FEATURENAME_CHAIN that are never created). Should we update it or remove it? As it is now, if I understood correctly, it's not usable. @cbaakman
The GNN features are sometimes residue-specific features and sometimes atom-specific. It would be nice if the feature name included something like 'atom' or 'aa'.
type
for the atom-based gnn is a residue feature. We could call itres_type
.