UT-CHG / BET

Python package for data-consistent stochastic inverse and forward problems.
http://ut-chg.github.io/BET
Other
11 stars 21 forks source link

proposed workflow for sampling #327

Closed mathematicalmichael closed 4 years ago

mathematicalmichael commented 5 years ago
input_sample_set = sample.sample_set(input_dim)
input_sample_set.set_initial() # uniform in all dims on unit domain
Q = np.random.random([output_dim, input_dim])
def Q_map(data, operator=Q):
    return (operator@data.T).T
sampler = bsam.sampler(Q_map, num_samples)

the following should create an identical set up to using bsam.random_sample_set('r', input_dim, num_samples) to make the input set. It checks for the presence of a scipy.distributions.rv_frozen/continuous object as part of the sample set (initial_distribution), and if there, use it to generate samples according to that distribution, and keep volumes undefined. my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set)

The sampler class currently does no inference on Q_ref based on input_sample_set._reference_parameter, which feels like a wasted opportunity to set default behaviors for simpleFunP, which can read Q_ref from the passed data instead of as an argument. For compatability with the sampling approach (and particularly data-driven), I want simpleFunP.normal_partition_normal_distribution and the uniform_partition variety to be used for both approaches. The goal being for an easy toggle between sampling and voronoi approach.

mathematicalmichael commented 5 years ago

more thoughts:

input_sample_set = bsam.random_sample_set('r', input_dim, num_samples)
sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set)

at this point, all that's missing is the specification of an output_probability_set

simpleFunP.name_of_dist(my_disc)  # should have the same effect as 
my_disc.set_observed()  # with default settings

if simpleFunP is used, make sure to correctly set the observed distribution.

Now, we are ready to solve:

calcP.prob(my_disc)

want to be able to replace the first line with

input_sample_set = samp.sample_set(dim)
input_sample_set.set_initial() # can specify any initial/prior distribution you want.
input_sample_set.generate_values() # where if values are blank, sampler calls it and fills them in.

this can allow for the skipping of setting the sample set with basicSampling and let the sampler object do that work intelligently.

here is one possible workflow:

# generate normal initial and infer the dimension and domain from the keyword args.
input_sample_set = bsam.random_sample_set('n', loc=0, scale=1)
sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set)
my_disc.set_observed()  # optional? 
my_disc.set_predicted() # compute pushforward, should be optional, can pass analytical expression. if blank, default to gaussian kde

and now you should be able to call my_disc.updated_pdf to evaluate the updated density at a new sample. to evaluate the density at the existing sample points, you can call them method my_disc.set_probabilities() which will multiply the density evaluations and volumes to set input_sample_set._probabilities.

The thinking is that you should see a consistency in the solutions between the two methods. If you set the problem up with the sample-based approach, calling calculateP.den should be calling my_disc.set_probabilities() and giving you something to compare against using calculateP.prob

Thinking about something similar, for the data-driven approach, where the observed can be taken as the default N(0,1)... which would allow for something like: We would need to build in something like my_disc._noise_model which is also a distribution... to compute_QoI... can pass integer to imply input_dim and uniform random hypercube, the same way the other methods in this sub-module assume if an integer is passed instead of a sample set.

sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set) 
my_disc.set_noise_model(dist.norm, scale=0.1) # should be smart enough to infer output dim. if it gets a number, use it to set `data_std` for the data-driven map approach.
# my_disc.noisy_data(std=1) # does your model already perturb output_values or do we need to? if called, pass the `data_std` value. the idea being that you can avoid setting the noise model.
# if calling data driven, by default we will be perturbing the data values. These noise realizations will be stored inside of the output_probability_set, along with the noise distribution. If we already have noisy data values, set the distribution but leave the values blank.

(if no noise model, do not perturb data, assume unit variance if using data-driven. if noise model provided, assume Q_ref is "pure" and perturb it with noise.) set_noise_model will set an attribute data_std (i.e. grab .std() from provided distribution, if missing, try root of .var(), else print warning about not being able to infer and set it to 1.).

set_noise_model should actually just set the output probability set, since the interpretation makes sense. You can use it to set the observed if you aren't in a data-driven approach. The values for this sample set will be the RESIDUALS (so the shape still matches the output dim). When "evaluating" the data-driven map, perform sum and division. yes, little less efficient, but it's a quick operation, no worse than what is required in the voronoi-cell approach in calculateP.prob. This should allow for toggling of modes, so you can compare SSE/MSE/SWE QoI maps and their solutions.

now that the std is known, we can generate our dataset of residuals. output_vals - data, where data is Q_ref + noise.rvs(output_dim).

mathematicalmichael commented 5 years ago

Here's what I'm coming around to: sensitivity to the data:

my_disc_new.set_initial()
my_disc_new.set_observed(scale=0.1) # first time through, it will generate synthetic data.
my_disc_new.set_data_from_reference() # draws from existing observed to create new synthetic data
my_disc_new.set_observed(scale=0.2) # imposes this as the new observed distribution center
plt.scatter(I._values[:,0], I._values[:,1], c=my_disc_new.updated_pdf(I._values))
plt.show()

all the updates are performed by the re-setting of the observed. the pushforward is only computed once, on first start-up.

this is for non-data driven mode.

mathematicalmichael commented 5 years ago

repeated observations on the same QoI as data come in... is now extremely memory efficient. never stores vectors it doesn't need.

my_disc_new.data_driven(inds=[0,0,0,0], std=0.4, data=np.array([0.8,1.2,1.3, 1.4])) 

my_disc_new.get_setup()

{0: {'col': True,
  'inds': [0, 0, 0, 0],
  'qoi': 'SWE',
  'std': 0.4,
  'obs': None,
  'pre': <scipy.stats.kde.gaussian_kde at 0x7fe6e4925048>,
  'model': <function __main__.one_map(data, operator=array([1, 1]))>}}

my_disc_new.get_data()

array([0.8, 1.2, 1.3, 1.4])

omit data to simulate bootstrapping! change error assumptions as you wish. updated density will immediately reflect changes!

can pass vector of std's if the data becoming available is of different quality:

my_disc_new.data_driven(inds=[0,0,0], std=[0.1, 0.2, 0.3]) 
my_disc_new.get_setup()

{0: {'col': True,
  'inds': [0, 0, 0],
  'qoi': 'SWE',
  'std': array([0.1, 0.2, 0.3]),
  'obs': None,
  'pre': <scipy.stats.kde.gaussian_kde at 0x7fe6e40cc240>,
  'model': <function __main__.one_map(data, operator=array([1, 1]))>}}

my_disc_new.get_data()

array([0.8, 0.8, 0.8])

had we previously cleared the existing reference data, it would attempt to use a reference value and hit it with noise to simulate data. So, you can pass data=None (default) to do this, or simply

my_disc_new._output_probability_set._reference_value = None

my_disc_new.data_driven(inds=[0,0,0])

my_disc_new.get_data()

array([1.87604622, 2.00842052, 1.84795764])
mathematicalmichael commented 5 years ago

more thoughts:

input_sample_set = bsam.random_sample_set('r', input_dim, num_samples)
sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set)

at this point, all that's missing is the specification of an output_probability_set

simpleFunP.name_of_dist(my_disc)  # should have the same effect as 
my_disc.set_observed()  # with default settings

if simpleFunP is used, make sure to correctly set the observed distribution.

Now, we are ready to solve:

calcP.prob(my_disc)

want to be able to replace the first line with

input_sample_set = samp.sample_set(dim)
input_sample_set.set_initial() # can specify any initial/prior distribution you want.
input_sample_set.generate_values() # where if values are blank, sampler calls it and fills them in.

this can allow for the skipping of setting the sample set with basicSampling and let the sampler object do that work intelligently.

here is one possible workflow:

# generate normal initial and infer the dimension and domain from the keyword args.
input_sample_set = bsam.random_sample_set('n', loc=0, scale=1)
sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set)
my_disc.set_observed()  # optional? 
my_disc.set_predicted() # compute pushforward, should be optional, can pass analytical expression. if blank, default to gaussian kde

and now you should be able to call my_disc.updated_pdf to evaluate the updated density at a new sample. to evaluate the density at the existing sample points, you can call them method my_disc.set_probabilities() which will multiply the density evaluations and volumes to set input_sample_set._probabilities.

The thinking is that you should see a consistency in the solutions between the two methods. If you set the problem up with the sample-based approach, calling calculateP.den should be calling my_disc.set_probabilities() and giving you something to compare against using calculateP.prob

Thinking about something similar, for the data-driven approach, where the observed can be taken as the default N(0,1)... which would allow for something like: We would need to build in something like my_disc._noise_model which is also a distribution... to compute_QoI... can pass integer to imply input_dim and uniform random hypercube, the same way the other methods in this sub-module assume if an integer is passed instead of a sample set.

sampler = bsam.sampler(Q_map, num_samples)
my_disc = sampler.compute_QoI_and_create_discretization(input_sample_set) 
my_disc.set_noise_model(dist.norm, scale=0.1) # should be smart enough to infer output dim. if it gets a number, use it to set `data_std` for the data-driven map approach.
# my_disc.noisy_data(std=1) # does your model already perturb output_values or do we need to? if called, pass the `data_std` value. the idea being that you can avoid setting the noise model.
# if calling data driven, by default we will be perturbing the data values. These noise realizations will be stored inside of the output_probability_set, along with the noise distribution. If we already have noisy data values, set the distribution but leave the values blank.

(if no noise model, do not perturb data, assume unit variance if using data-driven. if noise model provided, assume Q_ref is "pure" and perturb it with noise.) set_noise_model will set an attribute data_std (i.e. grab .std() from provided distribution, if missing, try root of .var(), else print warning about not being able to infer and set it to 1.).

set_noise_model should actually just set the output probability set, since the interpretation makes sense. You can use it to set the observed if you aren't in a data-driven approach. The values for this sample set will be the RESIDUALS (so the shape still matches the output dim). When "evaluating" the data-driven map, perform sum and division. yes, little less efficient, but it's a quick operation, no worse than what is required in the voronoi-cell approach in calculateP.prob. This should allow for toggling of modes, so you can compare SSE/MSE/SWE QoI maps and their solutions.

now that the std is known, we can generate our dataset of residuals. output_vals - data, where data is Q_ref + noise.rvs(output_dim).

mathematicalmichael commented 5 years ago

NOTE: this is all just here for posterity. Actively re-factoring code now. I decided to chop out a lot of "clever" functionality in favor of clearer functions. I really want something slick for being able to switch between the modes, but it may not be worth chasing.

if someone sets an observed, I'm trying to decide whether or not to use the median/mean as the "data" if someone switches modes, and inferring std from the already defined observed.

mathematicalmichael commented 4 years ago

welp, none of this is relevant anymore. workflow's been refactored.