coarse-graining / cgnet

learning coarse-grained force fields
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Get certain features #62

Closed brookehus closed 5 years ago

brookehus commented 5 years ago

Here is my suggestion for a better way of handling return_indices and get_prior_statistics with custom keys. I think it is much better and more efficient. Here is my suggested course of action:

  1. Merge this into #61 if you agree with it
  2. Close #60 (weird diff?) and proceed with #61 as your (@nec4's) PR - you can commit to it, I need to PR to it. 3a. Prioritize fixing the tests that are failing because of the bond constant dictionary. 3b. Possibly add tests that look like your proposed version of return_indices? I think this would be helpful if you are willing.
  3. I review #60 and we merge it into #58

What do you think?

nec4 commented 5 years ago

Sounds good. Reviewing now.

nec4 commented 5 years ago

I think I found some strange behavior: for a 5 bead test molecule with full backbone stats: If I allow features to be a kwarg in get_prior_statistics() (see comment above):

prior_features = stats.get_prior_statistics(features=[(0,1),(3,4)])
for k,v in prior_features.items():
    print(k,v)
0 {'mean': tensor(1.9042), 'std': tensor(0.6031)}
3 {'mean': tensor(2.5001), 'std': tensor(0.6525)}

looks good. Now changing the first tuple somehow changes the index of the second:

prior_features = stats.get_prior_statistics(features=[(0,2),(3,4)])
for k,v in prior_features.items():
    print(k,v)
3 {'mean': tensor(1.9042), 'std': tensor(0.6031)}
4 {'mean': tensor(2.5001), 'std': tensor(0.6525)}

With respect to the master list attribute, the sub-sequence [(0,2),(3,4)] is out of order. Should we guard against/correct this behavior if the user enters in an unordered list of feature tuples?

Edit: seems that there is something funky in return_indices. In this formulation, the method assumes the input features will be ordered in the same sense as the master list attribute.

print(stats.return_indices([(0,1),(3,4)]))
[0, 3]

print(stats.return_indices([(0,2),(3,4)]))
[3, 4]

For a general user this may be unsafe unless we state somewhere in the docs or correct for it. Will take a closer look. What do you think?

brookehus commented 5 years ago

I will look at this funky behavior too.

brookehus commented 5 years ago

Although note that the keys being integers was a bug - they're back to the tuples now. But we could change to integers? Seems more confusing for the reason you described above

brookehus commented 5 years ago

Summarizing some of what I said on slack:

nec4 commented 5 years ago

Got it - I will take another look and let you know if anything comes up. Barring that, I think I am fine with the changes - will give you the LGTM then.

nec4 commented 5 years ago

Removing the ability to not return a dictionary seems alright to me - we have been using dictionaries for priors up to this point, so it does not impact much. However, I still worry about the order of feature tuples in the inputs of get_prior_stats() and in turn return_indices():

prior_features = stats.get_prior_statistics()
for k,v in prior_features.items():
    print(k,v)
print("")
test_features = stats.get_prior_statistics(features=[(2,3,4),(1,2),(1,2,3,4,'cos')])
for k,v in test_features.items():
    print(k,v)

(0, 1) {'mean': 2.012418908139269, 'std': 0.6173332929292423}
(1, 2) {'mean': 2.4216726369681227, 'std': 0.8213288285770245}
(2, 3) {'mean': 2.8745282143687696, 'std': 1.3714571997002478}
(3, 4) {'mean': 2.5493349113034016, 'std': 0.47363389711236953}
(0, 2) {'mean': 1.774948456473321, 'std': 0.29663486808813494}
(1, 3) {'mean': 2.4828062612693076, 'std': 1.2339970202914567}
(2, 4) {'mean': 2.933023516839941, 'std': 0.9274496856958656}
(0, 3) {'mean': 2.6964894482936677, 'std': 1.072305730254537}
(1, 4) {'mean': 2.018970301340096, 'std': 0.6946823039715447}
(0, 4) {'mean': 2.057845895845392, 'std': 0.7753098750332134}
(0, 1, 2) {'mean': 2.323987066954144, 'std': 0.34861435568647137}
(1, 2, 3) {'mean': 2.293738041847179, 'std': 0.44558315338102344}
(2, 3, 4) {'mean': 1.8407708108972223, 'std': 0.7627003658227768}
(0, 1, 2, 3, 'cos') {'mean': 0.19688260802053792, 'std': 0.5367364041760957}
(1, 2, 3, 4, 'cos') {'mean': 0.6946537881703886, 'std': 0.5081295092480996}
(0, 1, 2, 3, 'sin') {'mean': 0.16809996805613742, 'std': 0.8030527204553083}
(1, 2, 3, 4, 'sin') {'mean': -0.3380883081097161, 'std': 0.3807319428828161}

(2, 3, 4) {'mean': 2.4216726369681227, 'std': 0.8213288285770245}
(1, 2) {'mean': 1.8407708108972223, 'std': 0.7627003658227768}
(1, 2, 3, 4, 'cos') {'mean': 0.6946537881703886, 'std': 0.5081295092480996}

This output is incorrect, as the first two tuples' statistics are reversed. I think its important to account for these cases - either by forcing the order of input features to these functions to obey the sense of self.order or just generalize to allow the user to get back correct statistics in whatever weird order they may impose.

nec4 commented 5 years ago

LGTM!