bancaditalia / black-it

Black-box abm calibration kit by the Bank of Italy
https://bancaditalia.github.io/black-it/
GNU Affero General Public License v3.0
43 stars 2 forks source link

[WIP] Update `BaseSampler` interface #4

Closed marcofavoritobi closed 2 years ago

marcofavoritobi commented 2 years ago

Proposed changes

Remove single_sample as abstract method, and instead make sample_batch the core abstract method to implement.

Fixes

n/a

Types of changes

What types of changes does your code introduce? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply.

Further comments

n/a

marcofavoritobi commented 2 years ago

~TODO: after all samplers have been updated according to the new BaseSampler class, update seed getter/setter with Optional[int]~ done

marcofavoritobi commented 2 years ago

Halton sampler "visual proof" of functional equivalence (i.e. low-discrepancy).

Script to reproduce the plots:

import time
import numpy as np
from black_it.samplers.halton import HaltonSampler
from black_it.search_space import SearchSpace

import matplotlib.pyplot as plt

from black_it.samplers.r_sequence import RSequenceSampler
from black_it.search_space import SearchSpace

N = 10000
sampler = HaltonSampler(batch_size=8, random_state=0)
param_grid = SearchSpace(
    parameters_bounds=np.array([[0, 1], [0, 1]]).T,
    parameters_precision=np.array([0.00001, 0.00001]),
    verbose=False,
)
new_params = sampler.sample_batch(N, param_grid, np.zeros((0, 2)), np.zeros((0, 2)))

plt.scatter(*zip(*new_params), s=0.1)
plt.show()
AldoGl commented 2 years ago

Ok great, the two figures do indeed look alike, while before you fixed that bug the new version was showing some regular stripes patterns

Edit: forget it, I mistook the figures as R-samples, while they refer to the Halton sampler which was working even before

marcofavoritobi commented 2 years ago

Before 44109a1:

image

After 44109a1:

image

Looks wrong...

Code snippet:

import time
import numpy as np
from black_it.samplers.halton import HaltonSampler
from black_it.search_space import SearchSpace

import matplotlib.pyplot as plt

from black_it.samplers.r_sequence import RSequenceSampler
from black_it.search_space import SearchSpace

N = 10000
sampler = RSequenceSampler(batch_size=8, random_state=0)
param_grid = SearchSpace(
    parameters_bounds=np.array([[0, 1], [0, 1]]).T,
    parameters_precision=np.array([0.00001, 0.00001]),
    verbose=False,
)
new_params = sampler.sample_batch(N, param_grid, np.zeros((0, 2)), np.zeros((0, 2)))

plt.scatter(*zip(*new_params), s=0.1)
plt.show()
marcofavoritobi commented 2 years ago

Before: image

After: image

The error was that according to the definition, only the last phi of the vector must be used.

marcofavoritobi commented 2 years ago

@AldoGl I brought in all the relevant changes from #2, and fixed the bug of the R-sequence sampler. Is there something else missing to this PR?

BestBatchSampler could be improved as some operations (the shock of the parameters) are not vectorized. I would leave it as future work, since it is good enough as it is now.

If @muxator has some time, would be useful to have his opinion as well.

AldoGl commented 2 years ago

Thanks Marco, great and very timely work.

The only improvement I can think of is to remove the call digitize_data(sampled_points, search_space.param_grid) from the return statement of all sample_batch methods, and plug it as a discretisation step inside the sample method of the BaseSampler. For instance, in line 116 of the base.py module we might have a line with samples = digitize_data(samples, search_space.param_grid). This would simplify the creation of a new sampler as the user could rely on the base class for discretisation and would not need to think about it.

In case we agree, @muxator might gave a smart way of applying this change to all past commits, as an alternative we might apply this improvement on top of all commits (probably this is the best choice)

marcofavoritobi commented 2 years ago

Thanks Marco, great and very timely work.

The only improvement I can think of is to remove the call digitize_data(sampled_points, search_space.param_grid) from the return statement of all sample_batch methods, and plug it as a discretisation step inside the sample method of the BaseSampler. For instance, in line 116 of the base.py module we might have a line with samples = digitize_data(samples, search_space.param_grid). This would simplify the creation of a new sampler as the user could rely on the base class for discretisation and would not need to think about it.

In case we agree, @muxator might gave a smart way of applying this change to all past commits, as an alternative we might apply this improvement on top of all commits (probably this is the best choice)

Thank you for the suggestion. I have some thoughts on that. As a counterargument for removing digitized_data from sample_batch is that such method should return a batch of parameters of the desired search space; this is somewhat an implicit requirement due to the parameter search_grid of the abstract method sample_batch. A good addition is to add this requirement in the docstring (and in this case, adding a check in sample seems too much and not desirable for performance purposes). Another point is that, if some samplers already return data that are digitized, e.g. this is the case for the random uniform sampler, then the further digitization in 'sample' is redundant; we could first add a check whether the params returned by sample_batch are digitized or not, but it is still an overhead.

AldoGl commented 2 years ago

Ok I am convinced by what you say. I think we can merge the PR as it is is also @muxator agrees

muxator commented 2 years ago

LGTM.

The Ci failure is due to a transient error in npm. Subsequent runs of the failed job succeeded, I do not know why they are not showing here.

AldoGl commented 2 years ago

tests finally passing, merging now!