Closed brookehus closed 5 years ago
@coarse-graining/developers FYI this is going to be hairy - if you plan on making changes to ProteinBackboneStatistics
or ProteinBackboneFeature
(both of which may even be renamed) merge to this PR or just wait.
Sounds good. I'll take a look!
@coarse-graining/developers : ProteinBackboneStatistics
and ProteinBackboneFeature
are no longer going to be backbone only. What do you want to rename them? ProteinStatistics
and ProteinFeature
? Or something more general like GeometryStatistics
and GeometryFeature
?
GeometryFeature and GeometryStats sound great to me!
Ok. I've made a buuunch of changes. A key one is that I got rid of the bond constant dictionary. Now you just get a prior_statistics_dictionary
where the constant is automatically calculated for everything you specify. If you want it for only certain things, then you could make a new GeometryStatistics
object for only those things? Or you could post-process by only grabbing certain keys using return_indices
? So I'm imagining the pipeline is like,
GeometryStatistics
return_indices
to get indices for the priors you wantprior_statistics_array
according to those indices to get only e.g. bond and angle constantsSo this obviously breaks all the priors stuff. @nec4 can you give this a hard look and see if you think the new structure works with the priors?
A key thing that I haven't done yet is worked much with or tested bonds. What I'm doing now is there are two new keyword arguments to the GeometryStatistics
class: bond_inds
(list of 2-element tuples) and adjacent_backbone_bonds
(bool). You feed in stuff that is bonded which has to be contained in the features that are being calculated (trivial if you have get_all_distances=True
, otherwise must be contained in custom_features
). Then bonds are populated according to bond_inds
as well as pairs that are adjacent along the backbone if adjacent_backbone_bonds=True
(redundancies are removed). return_indices
draws from bond_inds
. This is untested right now.
Everything here looks really good. However, it would be nice to be able to quickly instantiate a GeometryFeature
layer off of a GeometryStatistics
layer with possibly custom features in the part of the pipeline that precedes model training. Since GeometryFeature.__init__()
takes an optional list of tuples argument feature_inds
, would it be useful for GeometryStatistics
to have some method that could return a simple list of all feature tuples (Currently there are just attributes for bonds/adjacent neighbors, etc)? The motivation for this comes from the idea that GeometryStatistics
is used for preprocessing and aids in construction of _PriorLayer
objects, and once that is done, the user can simply just grab all feature tuples and feed them directly to a GeometryFeature
layer in a single line.
Of course, one could also just do something like:
feature_tuples = []
for desc in GeometryStatistics.order:
feature_tuples += GeometryStatistics.descriptions[desc]
FeatureLayer = GeometryFeature(feature_inds=feature_tuples)
Bearing in mind that this approach also produces duplicate dihedral tuples (see the comment below), what do you think?
I think I addressed or followed up to all of your comments!
Can you update your review? @nec4 what do you think we have to do here besides "functional testing" via fixing the example nb?
I can't think of anything off the top of my head - perhaps something will come up as we go through the example notebook. I have been playing with the tools using random data in a little notebook for a while now and I am happy with how they work. Will update the review.
Ok, added a get_zscore_array
method (this commit) and test (this commit).
The test only addresses the correspondence of get_zscore_array
to get_prior_statistics
, not the independent functioning of get_zscore_array
.
[WIP] - Watch this space!
To do list (just for myself)
geometry
; possibly do them all the same way, or keep the faster scipy option for numpy only? Check whether dmat already does the work of redundant distances? Possibly move this entire thing togeometry
?Part II (#61):
Part III: