Closed tp832944 closed 9 months ago
Will roll into sprint 4.
See https://github.com/gchq/coreax/issues/104#issuecomment-1707984419 for some useful thoughts.
@tl82649 and I are considering following the model of sklearn. One class for for each coreset calculation method. Use one or two levels of abstract base classes to enforce a consistent interface. Method parameters but not data are stored on the class.
I am a little concerned that we'll end up with classes consisting of only one method, calculate(), in which case why not stick with functional programming?
@tl82649 is aiming to sketch out a design before the next sprint planning meeting. Then we can evaluate:
@tp832944 I've attached a sketch to start thinking about an OOP version. Hopefully it's reasonably clear -- there's a key -- but it is a first pass with no real formality.
I believe you can open the attached .png in draw.io and it will pick up the embedded diagram for editing. Let me know if not and I'll attach the diagram file.
@sk062380
@tl82649 I think we agreed to only store hyperparameters on the class, no data. We intended to follow the sklearn architecture. However, sklearn does appear to store the fitted parameters on the class, but not the input data. To follow this approach here would mean storing the calculated coreset on KernelHerding class etc., but not storing the original data there. Only the hyperparameters are set during init with the coreset being set during a calculate_coreset method, which takes the original data as input. I'm not sure whether I misunderstood your vision and you are already suggesting this. What do you think of the approach as I've just described?
@tp832944 yes that sounds like a sensible approach, and the one I was originally intending here. Storing the coreset on the class should be okay in most cases, though we may have to think about memory if the coreset size is still large or the dimensionality of the data is large, but I suspect they would be edge cases. Storing the coreset on the class would also be useful for any follow-on refine
calls.
I'd suggest storing the indices only, but they would point to a data object that lies outside the class. In that case we could be explicit about what the indices mean, but it still leaves the risk of confusion if data rows are permuted elsewhere for example. For refine
(and other) calls, we could insist on having the indices supplied as a function argument to defend against this (having returned them previously during calculate_coreset
).
@tl82649 @gchqdev227 @pc532627 A few queries:
Okay, I think @tl82649 is going to need to weigh in on most of these, but my initial thoughts:
NotImplementedError
instead.refine
method on each kernel class? Arguably one would put such a method on a Coreset class, but that doesn't fit with our current plan. I'm not sure what you mean re the Kernel constructor, although it's likely I've missed some context somewhere.functools.partial
to give themselves a "frozen" metric function if that mattered, but I tend to err on the side of classes here for two reasons:
__init__
and another method) then we can always go back for simplicity. Think of it as a refactoring method than a final state!Thanks for your input @gchqdev227 .
@gchqdev227 @tl82649 Do we need separate RBFKernel and NormalisedRBFKernel?
@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design.
@tl82649 Some one more query:
The purpose of a Data ABC is that some types of input data need more preprocessing, e.g. PCA, than others due to their size. We need more design work to consider how this is going to work. Coreset really only wants to take an array as input data, so the Data class should really be about preparing an array from whatever input it is designed to process.
@tp832944 Some responses in turn.
- Do we want a SteinKernelHerding class in addition to standard KernelHerding? [...]
No. Stein herding is simply herding with the Stein kernel. I actually think we should -- at some point, possibly post-OOP -- switch the herding examples to use simpler kernels, since the Stein kernel is a complicated introduction given its dependence on the score function (see #150). Also, the Stein kernel is more appropriate for the putative SteinThinning
class. Stein thinning uses a slightly different herding-esque method that utilises a key property of the Stein kernel, i.e. avoiding integrals w.r.t. $\mathbb{P}$.
- There are some unused functions for standard kernel herding body and Frank-Wolfe kernel herding body. Do we want to keep these?
Yes, remove. These are hallmarks of a Hilbert space method that we haven't advanced yet. We can always bring them back in for later methods.
- There are three functions in refine.py for different ways to refine a coreset. [...] Alternatively, we could create a Refine ABC with three subclasses. [...]
Refine ABC. The kernel simply defines the metric for choosing points.
- Will the implementation of refine methods always be independent of the coreset method?
Yes. Any use of refine steps as a component of a coreset method, e.g. in kernel thinning, would still make use of the refine
method in the putative Coreset ABC
object.
- For functions like grad_x of a kernel, is the gram_matrix parameter only there to avoid recalculating if it's already been calculated in the parent function, or could it in theory be anything?
Exactly, it's to avoid expensive recomputation if computation has already been done.
- Is there any need to OOP metrics? These feel like pure functions.
Another argument for OOP here might be the distinction between mathematical metrics, i.e. those that obey the properties of a metric, and quality metrics, e.g. KL-divergence (which is asymmetric). Could we envisage a method that utilises the properties of integral probability metrics?
- @tl82649 proposed a DatasetReduction class. What existing functions should become part of this class?
I think refine
should actually sit here, since it's not specific to coresets, i.e. it could be used in distillation methods. Thinking more on this, the refine(A, B)
method should take in two sets of points A
and B
, and return a set of points C
that consist of some points of A
and B
that greedily improve a metric, e.g. MMD, over B
alone.
Also possibly fit
and save
from Coreset ABC
? The key discriminant of a coreset is the storage of indices over points, which should justify the continued use of the Coreset ABC
with the aforementioned methods moved up to the DatasetReduction
class.
@gchqdev227 @tl82649 Do we need separate RBFKernel and NormalisedRBFKernel?
@tp832944 Not strictly. The former was intended as a standalone kernel for herding, etc. The latter is a convenience function intended for explicit score functions with KDEs, e.g. following sklearn's KernelDensity. They call it a Gaussian kernel, which is different to a Gaussian RBF.
Perhaps we should be explicit and use GaussianDensityKernel
instead of NormalisedRBFKernel
? Then we could use RBFKernel
as an abstract class since technically it refers to a group of radial basis functions.. What we call RBFKernel
can then be renamed to SquaredExponentialKernel(RBFKernel)
(to be consistent with common terminology) or GaussianKernel(RBFKernel)
(to be consistent with the Wikipedia definition), and the other kernels, e.g. PC-IMQ, can also inherit from RBFKernel
.
@tl82649 Some one more query:
- Which existing function maps onto weights.MMD? I can't find anything suitable in docstrings.
@tp832944 Technically weights.simplex_weights
, since any solution to that quadratic form minimises MMD w.r.t. the supplied kernel. The simplex constraint gives a reduced solution set. To minimise KSD for example, we need the non-negativity constraint and a different objective function, cf. #134.
@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design.
@tp832944 Looks mostly fine to me. Some comments:
fit
, refine
, save
, original_data
and possibly render
from Coreset ABC
to DatasetReduction
following my comments herefit
should have a kernel
argument. Not all coreset (or dataset reduction) methods will use kernels. Perhaps kernel
should be passed as an initialisation argument for (and set as an attribute of) kernel methods, e.g. KernelHerding
? Or an intermediary parent class, e.g. KernelCoreset(Coreset ABC)
?
- For functions like grad_x of a kernel, is the gram_matrix parameter only there to avoid recalculating if it's already been calculated in the parent function, or could it in theory be anything?
Exactly, it's to avoid expensive recomputation if computation has already been done.
@tl82649 This sounds like an opportunity for automated caching to save the user having to precompute and pass around an extra variable. The snag is it depends on x and y.
We could implement with an LRU cache, or assume x and y are immutable (sounds dangerous), or add x and y to the Kernel class. If x and y are on the Kernel class, either a new instance is required for each x and y (perhaps call it KernelInstance) or use property setters to invalidate any cached Gram matrix whenever x or y is changed. If we're only likely to need a cache size of one, my final option seems very tempting.
However, this sounds like it wouldn't work well with Jax, which expects pure functions and repeated calls of the same function in the case of vectorisation. Therefore, I feel we may be stuck with offering the user to pass in the Gram matrix each time for a performance boost but with no nice way to do that automatically.
Do you agree with my conclusion? Do you have any alternative ideas?
@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design.
@tp832944 Looks mostly fine to me. Some comments:
- I suggest moving
fit
,refine
,save
,original_data
and possiblyrender
fromCoreset ABC
toDatasetReduction
following my comments here- I don't think
fit
should have akernel
argument. Not all coreset (or dataset reduction) methods will use kernels. Perhapskernel
should be passed as an initialisation argument for (and set as an attribute of) kernel methods, e.g.KernelHerding
? Or an intermediary parent class, e.g.KernelCoreset(Coreset ABC)
?
@tl82649 Is what I intended by Coreset.fit() the same as DatasetReduction.size_reduce()?
I'm happy to move fit() etc. to DatasetReduction.
@tl82649 @gchqdev227 @pc532627 @jd750687 I believe I have now addressed all points, except that I have run out of time to produce a full class diagram. Thank you for your input. This design is definitely not final but is close enough to start implementing. Happy to take further comments.
orig_data = data_handler.load("david.png")
coreset = orig_data.reduce("map_reduce", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset.render()
random_points.render()
Reopening for some design changes.
Updated design covering various factories, especially in reduction.py.
orig_data = load_file("david.png")
coreset = orig_data.reduce("map reduce", "kernel herding", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("size reduce", "random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset.render()
random_points.render()
@bk958178 @sk062380 Some good structures to follow from #183:
Updated OOP design reintroducing ClassFactory, now in util.py. Sometimes you just have to try before you realise there's a better way. Sorry for the U-turn.
@tl82649 I think you specified weighted
as an attribute on what has become the DataReduction ABC. Based on examples/weighted_herding.py
, this should be boolean, but when True
does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option of None
to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?
@tp832944
@tl82649 I think you specified
weighted
as an attribute on what has become the DataReduction ABC. Based onexamples/weighted_herding.py
, this should be boolean, but whenTrue
does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option ofNone
to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?
Yes, this should specify a method, since I envisage a few optimisation methods to be added in future; in addition to the two currently implemented. None
is sensible for no weights.
The weighted_herdiing.py
example calls coreax.util.solve_qp
directly since it already has the requisite data structures to hand (simplex_weights
computes these as part of the routine).
I suggest two additional arguments to simplex_weights
, namely its kbar
and Kmm
, which can be optionally supplied if already computed. weighted_herding.py
should then call simplex_weights
accordingly.
@tp832944
@tl82649 I think you specified
weighted
as an attribute on what has become the DataReduction ABC. Based onexamples/weighted_herding.py
, this should be boolean, but whenTrue
does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option ofNone
to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?Yes, this should specify a method, since I envisage a few optimisation methods to be added in future; in addition to the two currently implemented.
None
is sensible for no weights.The
weighted_herdiing.py
example callscoreax.util.solve_qp
directly since it already has the requisite data structures to hand (simplex_weights
computes these as part of the routine).I suggest two additional arguments to
simplex_weights
, namely itskbar
andKmm
, which can be optionally supplied if already computed.weighted_herding.py
should then callsimplex_weights
accordingly.
Thanks @tl82649 , I've added these changes to the next update to the OOP diagram.
Here is an update accounting for a few changes around weights, as above.
@tl82649 @gchqdev227 @pc532627 @sk062380 @bk958178 @jd750687 Here is the latest design.
orig_img: numpy.ndarray = cv2.imread("david.png")
orig_data = Image.load(orig_image)
coreset = orig_data.reduce("map_reduce", "kernel_herding", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("map_reduce", "random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset_plot = coreset.render()
coreset_plot.savefig("david_coreset.png")
random_points.render()
@sk062380 Diagram updated to move reduction_indices
from DataReduction
to Coreset
because the reduced data need not be a subset of the original for distillation.
@jd750687 I've made some minor edits to the wording of DataReader
but nothing for you to change.
@jd750687 Added DataReader.restore_dimension
, improved specification of DataReader.reduce_dimension
and added _dimension_reduction_meta
.
RefineRev
renamed to RefineReverse
.
The latest version from working through #167. Changes include a construct_kernel
function and more parameters to map_reduce
.
Thought for the future. This is all configured for reducing two-dimensional matrices. In theory, some of these methods would work in higher dimensions, although would probably require new versions. Maybe even a fork of this repo.
Some more updates:
DataReduction
renamed to Coreset
Coreset
renamed to CoreSubset
KernelMeanApproximator
moved to approximation
module@tl82649 @gchqdev227 @pc532627 @sk062380 @bk958178 @jd750687 I've made a few modifications to get MapReduce
working. What do you think of the following?
fit
and fit_to_size
. User should normally call fit
.clone_empty
, copy_fit
and _validate_fitted
added to Coreset
.coreset_indices
promoted to Coreset
although may be None
if not a coresubset method.orig_img: numpy.ndarray = cv2.imread("david.png")
orig_data = Image.load(orig_image)
kernel_herding = KernelHerding(kernel=SteinKernel(PCIMQKernel()))
kernel_herding.fit_with_strategy(orig_data, MapReduce(8000))
random_points = RandomSample()
random_points.fit_with_strategy(orig_data, SizeReduce(8000))
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset_plot = coreset.render()
coreset_plot.savefig("david_coreset.png")
random_points.render()
Blocked by #104