Get certain features - Githubissues

brookehus commented 5 years ago

Here is my suggestion for a better way of handling return_indices and get_prior_statistics with custom keys. I think it is much better and more efficient. Here is my suggested course of action:

Merge this into #61 if you agree with it
Close #60 (weird diff?) and proceed with #61 as your (@nec4's) PR - you can commit to it, I need to PR to it. 3a. Prioritize fixing the tests that are failing because of the bond constant dictionary. 3b. Possibly add tests that look like your proposed version of return_indices? I think this would be helpful if you are willing.
I review #60 and we merge it into #58

What do you think?

nec4 commented 5 years ago

Sounds good. Reviewing now.

nec4 commented 5 years ago

I think I found some strange behavior: for a 5 bead test molecule with full backbone stats: If I allow features to be a kwarg in get_prior_statistics() (see comment above):

prior_features = stats.get_prior_statistics(features=[(0,1),(3,4)])
for k,v in prior_features.items():
    print(k,v)
0 {'mean': tensor(1.9042), 'std': tensor(0.6031)}
3 {'mean': tensor(2.5001), 'std': tensor(0.6525)}

looks good. Now changing the first tuple somehow changes the index of the second:

prior_features = stats.get_prior_statistics(features=[(0,2),(3,4)])
for k,v in prior_features.items():
    print(k,v)
3 {'mean': tensor(1.9042), 'std': tensor(0.6031)}
4 {'mean': tensor(2.5001), 'std': tensor(0.6525)}

With respect to the master list attribute, the sub-sequence [(0,2),(3,4)] is out of order. Should we guard against/correct this behavior if the user enters in an unordered list of feature tuples?

Edit: seems that there is something funky in return_indices. In this formulation, the method assumes the input features will be ordered in the same sense as the master list attribute.

print(stats.return_indices([(0,1),(3,4)]))
[0, 3]

print(stats.return_indices([(0,2),(3,4)]))
[3, 4]

For a general user this may be unsafe unless we state somewhere in the docs or correct for it. Will take a closer look. What do you think?

brookehus commented 5 years ago

I will look at this funky behavior too.

brookehus commented 5 years ago

Although note that the keys being integers was a bug - they're back to the tuples now. But we could change to integers? Seems more confusing for the reason you described above

brookehus commented 5 years ago

Summarizing some of what I said on slack:

It was a mistake to ever have integer keys; the keys are now the 2, 3, and 5-lengthed tuples
get_prior_statistics had a ton of bugs (thanks for pointing lots of them out) and I think it is fixed now. This involves another attribute that lines up with master_description_tuples which is _master_stat_array. Otherwise we were getting bogus (and not realizing it! see below.)
I removed the option to not return a dict from get_prior_stats because of all the behind the scenes processing we do (namely removing duplicate features). Is this going to work for priors?
I fixed (and added more to) one of your tests (now named test_return_indices_and_prior_stats) because it was giving a typo-like error, but let me know if you think it's still testing what you intended to test.
I've renamed some tests to be about whether they're testing only the shape or not - I think this important. Lots of bugs weren't caught by the tests for get_prior_statistics because we only had shape tests.

nec4 commented 5 years ago

Got it - I will take another look and let you know if anything comes up. Barring that, I think I am fine with the changes - will give you the LGTM then.

nec4 commented 5 years ago

Removing the ability to not return a dictionary seems alright to me - we have been using dictionaries for priors up to this point, so it does not impact much. However, I still worry about the order of feature tuples in the inputs of get_prior_stats() and in turn return_indices():

prior_features = stats.get_prior_statistics()
for k,v in prior_features.items():
    print(k,v)
print("")
test_features = stats.get_prior_statistics(features=[(2,3,4),(1,2),(1,2,3,4,'cos')])
for k,v in test_features.items():
    print(k,v)

(0, 1) {'mean': 2.012418908139269, 'std': 0.6173332929292423}
(1, 2) {'mean': 2.4216726369681227, 'std': 0.8213288285770245}
(2, 3) {'mean': 2.8745282143687696, 'std': 1.3714571997002478}
(3, 4) {'mean': 2.5493349113034016, 'std': 0.47363389711236953}
(0, 2) {'mean': 1.774948456473321, 'std': 0.29663486808813494}
(1, 3) {'mean': 2.4828062612693076, 'std': 1.2339970202914567}
(2, 4) {'mean': 2.933023516839941, 'std': 0.9274496856958656}
(0, 3) {'mean': 2.6964894482936677, 'std': 1.072305730254537}
(1, 4) {'mean': 2.018970301340096, 'std': 0.6946823039715447}
(0, 4) {'mean': 2.057845895845392, 'std': 0.7753098750332134}
(0, 1, 2) {'mean': 2.323987066954144, 'std': 0.34861435568647137}
(1, 2, 3) {'mean': 2.293738041847179, 'std': 0.44558315338102344}
(2, 3, 4) {'mean': 1.8407708108972223, 'std': 0.7627003658227768}
(0, 1, 2, 3, 'cos') {'mean': 0.19688260802053792, 'std': 0.5367364041760957}
(1, 2, 3, 4, 'cos') {'mean': 0.6946537881703886, 'std': 0.5081295092480996}
(0, 1, 2, 3, 'sin') {'mean': 0.16809996805613742, 'std': 0.8030527204553083}
(1, 2, 3, 4, 'sin') {'mean': -0.3380883081097161, 'std': 0.3807319428828161}

(2, 3, 4) {'mean': 2.4216726369681227, 'std': 0.8213288285770245}
(1, 2) {'mean': 1.8407708108972223, 'std': 0.7627003658227768}
(1, 2, 3, 4, 'cos') {'mean': 0.6946537881703886, 'std': 0.5081295092480996}

This output is incorrect, as the first two tuples' statistics are reversed. I think its important to account for these cases - either by forcing the order of input features to these functions to obey the sense of self.order or just generalize to allow the user to get back correct statistics in whatever weird order they may impose.

nec4 commented 5 years ago

LGTM!

coarse-graining / cgnet

Get certain features #62