ag-csw / LDStreamHMMLearn

1 stars 0 forks source link

Determination of Effective Window Size in Bayes Calculation #15

Closed greenTara closed 7 years ago

greenTara commented 7 years ago

The formula currently used for the effective window size is not completely accurate. There is a small underestimate of the effective window size (for a given value of "r").

alexlafleur commented 7 years ago

Any ideas on that yet?

alexlafleur commented 7 years ago

We need a skript that takes a fixed dataset with fixed set of parameters - only window_size and r are varying: taumeta=4 shift=64 num_trajectories=2 len_trajectory=8^k+16*shift num_estimations = 16 num_states=4

window_size= range(128, 8^k > 4000), increasing by a factor of 2 Take the latest error within each of the window_size varying runs, run this all together 8 times with new samples, and average the errors.

The output should be two point plots showing LOG values:

alexlafleur commented 7 years ago

effective_window_size_plot

alexlafleur commented 7 years ago

effective_window_size_plot_numtraj

alexlafleur commented 7 years ago

num_trajectories=8 effective_window_size_plot

greenTara commented 7 years ago

The first task is to increase the number of runs (num_runs) so that the statistical fluctuations in these plots are smoothed out. Let's put num_trajectory back to 2 - all that is doing is reducing the error of the estimation, and that is not what we care about at this point.

Once we have enough runs so that we see a smooth dependency in these point plots, we need to see if there is an effect from num_estimations. I would like to perform the calculation over a range of num_estimations (1, 2, 4, 8, 16), and plot these as separate curves on the same graphs, with the axes the same as before.

alexlafleur commented 7 years ago

num_runs = 256, num_estimations = 8

effective_window_size_plot_256

alexlafleur commented 7 years ago

effective_window_size_plot

greenTara commented 7 years ago

The goal is to optimize the script so that a large number of runs can be made without unnecessary calculation. There are several points to address, some already completed:

  1. Since the naive algorithm should have the same expected error for all values k = 0 to num_estimations, then we omitted these estimations, except for the k=0 case, which is needed to set the prior in the bayes estimations. DONE
  2. Now the input and output signatures of the methods need to be fixed to accommodate this change. a) performance_and_error_calculation returns errbayes, not err. Also err is omitted from the call sequence DONE b) err is removed from the call sequence and return values where performance_and_error_calculation is called in get_errors DONE c) avg_err_final is removed from the return sequence of get_errors. Also remove this from the statement where get_errors is called. (TODO) Note that all statements involved in calculating avg_err_final can be deleted. But the place where this return value is used in the main test script needs to be updated naive[time] = avg_err_final => naive = avg_err_bayes[0] The loop at line 23 can be deleted altogether. d) avg_err_bayes_final is renamed to avg_err_bayes. It appears that all statements relevant to this return value are still correct e) the self.times array is no longer needed. The plots will, I think, use all values k = 0, 1, ..., 16, for the bayes case and that is OK.

As far as I can tell, that is all that must be changed regarding the plots.

For checking the formula, add the following set of print statements, one set for each point on the graph:

  1. Window Size: window_size
  2. Number of Estimations: num_estimations
  3. Shift
  4. Actual Expected Bayes/ ExpectedNaive Ratio: bayes[k]/bayes[0]
  5. Theoretical Bound for Expected Bayes/ Expected Naive Ratio: the formula

Another optimization:

There are a few places where renaming variables will lead to code that is more self-documenting.

  1. When taking the log for plotting, don't reuse the same variable name, add "log" to it somewhere.

That should do it. The number of runs will need to be at least 256, and may need to be more to get stable values for the expectation of the error.

alexlafleur commented 7 years ago

Numruns = 1028

effective_window_size_plot

temp.txt

greenTara commented 7 years ago

There has been no further comments on this issue for 15 days, but there was I believe some additional work on this script. Is there a related issue?

greenTara commented 7 years ago

I have an improved formula for the expectation of the error in the MM case:

err_bayes[k]/err_naive = math.sqrt( (1+math.pow(r,2*k+1))/((1+r) ))

Also

w = self.window_size err_naive = c/math.sqrt(w * num_trajs) where c is a function of eta, scale_window, taumeta, etc.

greenTara commented 7 years ago

Discussion continuing on #31