`oversample_thin: true` does not seem to reduce output

lukashergt commented 7 months ago

How exactly is the oversample_thin parameter supposed to work?

The reason that I am asking is that in the following reduced example I am getting almost exactly the same output (note that I am setting seed: 0) regardless of whether I am choosing oversample_thin: true or oversample_thin: false. I would have expected oversample_thin: true to lead to reduced output files (as it turns out, true ends up producing larger output files because it continues sampling a bit longer)...

likelihood:
  gauss3d: 'lambda x1, x2, x3: stats.multivariate_normal.logpdf((x1, x2, x3), mean=(0, 0, 0), cov=1)'
params:
  x1:
    prior:
      min: -6
      max: 6
    proposal: 0.5
  x2:
    prior:
      min: -6
      max: 6
    proposal: 0.5
  x3:
    prior:
      min: -6
      max: 6
    proposal: 0.5
sampler:
  mcmc:
    learn_proposal: false
    oversample_thin: false
    drag: false
    blocking:
    - [1, [x1, x2]]
    - [2, [x3]]
    seed: 0

cmbant commented 6 months ago

It should store only every Nth sample point. Depending on N (=current_point.output_thin) this may not make files smaller, e.g. when N is smaller than the mean number of samples at each point, e.g. the oversampling_factors are small. (in your example, you probably don't have a significant parameter speed hierarchy)

lukashergt commented 5 months ago

Hmm, ok, I'm clearly not quite getting how the oversampling is implemented...

Asked differently, for oversample_thin: true I would have expected all x1, x2 value pairs in the above example to be unique; however, here are the first few lines of the corresponding output file:

(py3121env) [~/cobayatest_oversample_thin]$ head -n 15 cobayatest_oversample_thin_true.1.txt
#        weight    minuslogpost              x1              x2              x3   minuslogprior minuslogprior__0            chi2   chi2__gauss3d
              1        30.07359       5.3152506      -2.2039542       2.5719266       7.4547199        7.4547199       45.237741       45.237741
              2       26.993948       5.3152506      -2.2039542      0.67492268       7.4547199        7.4547199       39.078455       39.078455
              1       27.309466       5.4104726      -2.1135579      0.67492268       7.4547199        7.4547199       39.709493       39.709493
              1        26.51419       5.5510187      -1.1558462      0.67492268       7.4547199        7.4547199       38.118941       38.118941
              1       27.118065       5.5510187      -1.1558462       1.2896787       7.4547199        7.4547199       39.326691       39.326691
              1       19.546187       4.0174204     -0.93078769       1.2896787       7.4547199        7.4547199       24.182934       24.182934
              1       20.723015       4.0174204     -0.93078769        2.004227       7.4547199        7.4547199       26.536589       26.536589
              1       21.337675       4.0174204     -0.93078769       2.2904687       7.4547199        7.4547199        27.76591        27.76591
              1       19.649555       4.0174204     -0.93078769      -1.3674819       7.4547199        7.4547199        24.38967        24.38967
              3        14.08358       1.6888194      -1.7383815      -1.3674819       7.4547199        7.4547199       13.257719       13.257719
              1       14.143138       1.6888194      -1.7383815      -1.4103633       7.4547199        7.4547199       13.376837       13.376837
              1       13.775281       1.6888194      -1.7383815      -1.1195576       7.4547199        7.4547199       12.641122       12.641122
              5        11.93854       1.3840606      0.53383186      -1.1195576       7.4547199        7.4547199       8.9676407       8.9676407
              2       12.378051       1.7479903     -0.15540818      -1.1195576       7.4547199        7.4547199       9.8466623       9.8466623

In many lines x1 and x2 stay at the same value. For oversample_thin: false this is expected, but for oversample_thin: true I would have expected these repetitions to be thrown away. Am I misunderstanding how the thinning works?

(in your example, you probably don't have a significant parameter speed hierarchy)

Well, I did manually block it 2 to 1, for which I get the same output for false and true. I tried again, this time changing the blocks to 5 to 1, and this did lead to longer outputs for false compared to true (about 9700 lines and 16000 summed weights for false, and 6700 lines and 7900 summed weights for true), still far from a 5:1 ratio, though, and still with repeating x1 and x2 values for true.

cmbant commented 5 months ago

You need to thin by more than the mean weight to get a significant reduction in the number of rows.

If the unthinned data has weights 2, 4, 2,2,2.. then thinning by 2 would give the output you quote. Imagine expanding unthinned data to many more rows where all rows have weight 1 (and some are duplicates), then take every other row.

lukashergt commented 5 months ago

Imagine expanding unthinned data to many more rows where all rows have weight 1 (and some are duplicates), then take every other row.

Ok, that is actually exactly what I was imagining... Which means I don't understand what is going on in the original example. The output is the exact same for oversample_thin: true and oversample_thin: false. Do you get that, too?

cmbant commented 5 months ago

Are you saying the thinning factor is larger than 1 but the mean weights of the output are actually identical? If you thin by 2, the output weights are only unity if the original weight was 1, 2 or 3 (depending on thinning phase). Thinning works on all lines, not just looking the subset of slowest parameters.

cmbant commented 3 months ago

Closing, please reopen if you have a reproducible example that is clearly a bug

CobayaSampler / cobaya

`oversample_thin: true` does not seem to reduce output #341