gchq / coreax

A library for coreset algorithms, written in Jax for fast execution and GPU support.
Apache License 2.0
25 stars 2 forks source link

Design classes/objects as product of conversion to object-orientated codebase #105

Closed tp832944 closed 9 months ago

tp832944 commented 1 year ago

Blocked by #104

tp832944 commented 1 year ago

Will roll into sprint 4.

tp832944 commented 1 year ago

See https://github.com/gchq/coreax/issues/104#issuecomment-1707984419 for some useful thoughts.

tp832944 commented 1 year ago

@tl82649 and I are considering following the model of sklearn. One class for for each coreset calculation method. Use one or two levels of abstract base classes to enforce a consistent interface. Method parameters but not data are stored on the class.

I am a little concerned that we'll end up with classes consisting of only one method, calculate(), in which case why not stick with functional programming?

@tl82649 is aiming to sketch out a design before the next sprint planning meeting. Then we can evaluate:

tl82649 commented 1 year ago

@tp832944 I've attached a sketch to start thinking about an OOP version. Hopefully it's reasonably clear -- there's a key -- but it is a first pass with no real formality.

I believe you can open the attached .png in draw.io and it will pick up the embedded diagram for editing. Let me know if not and I'll attach the diagram file.

coreax_ood drawio

gchqdev227 commented 1 year ago

@sk062380

tp832944 commented 1 year ago

@tl82649 I think we agreed to only store hyperparameters on the class, no data. We intended to follow the sklearn architecture. However, sklearn does appear to store the fitted parameters on the class, but not the input data. To follow this approach here would mean storing the calculated coreset on KernelHerding class etc., but not storing the original data there. Only the hyperparameters are set during init with the coreset being set during a calculate_coreset method, which takes the original data as input. I'm not sure whether I misunderstood your vision and you are already suggesting this. What do you think of the approach as I've just described?

tl82649 commented 1 year ago

@tp832944 yes that sounds like a sensible approach, and the one I was originally intending here. Storing the coreset on the class should be okay in most cases, though we may have to think about memory if the coreset size is still large or the dimensionality of the data is large, but I suspect they would be edge cases. Storing the coreset on the class would also be useful for any follow-on refine calls.

I'd suggest storing the indices only, but they would point to a data object that lies outside the class. In that case we could be explicit about what the indices mean, but it still leaves the risk of confusion if data rows are permuted elsewhere for example. For refine (and other) calls, we could insist on having the indices supplied as a function argument to defend against this (having returned them previously during calculate_coreset).

tp832944 commented 1 year ago

@tl82649 @gchqdev227 @pc532627 A few queries:

  1. Do we want a SteinKernelHerding class in addition to standard KernelHerding? I haven't been through line-by-line to see whether it is anything more than KernelHerding with a Stein kernel. It'd be neater if we don't require a separate SteinKernelHerding class. If we do, I'd want a KernelHerding ABC (abstract base class) - in which case, what should we call standard KernelHerding?
  2. There are some unused functions for standard kernel herding body and Frank-Wolfe kernel herding body. Do we want to keep these?
  3. There are three functions in refine.py for different ways to refine a coreset. They all depend on the kernel, so could in theory be added as methods to the Kernel ABC. Alternatively, we could create a Refine ABC with three subclasses. We can always add a constructor to Kernel. Any preference between these?
  4. Will the implementation of refine methods always be independent of the coreset method?
  5. For functions like grad_x of a kernel, is the gram_matrix parameter only there to avoid recalculating if it's already been calculated in the parent function, or could it in theory be anything?
  6. Is there any need to OOP metrics? These feel like pure functions.
  7. @tl82649 proposed a DatasetReduction class. What existing functions should become part of this class?
gchqdev227 commented 1 year ago

Okay, I think @tl82649 is going to need to weigh in on most of these, but my initial thoughts:

  1. I think that will depend how separable the two parts of Stein herding are, and whether we expect users to wish to expand in "both directions" if that makes sense. I think if we're careful then starting with your first option won't hold us back from making that change in the future. Incidentally, although ABCs do have their places in Python, I wouldn't necessarily reach for them yet unless we expect users to be implementing their own KernelHerding subclasses within a sufficiently complex stack that they need the early heads-up at implementation time and not runtime. I would start with a parent class which implements skeleton methods which raise NotImplementedError instead.
  2. I would argue that we should remove them, and if we need to refer to them again we can pick them out of the history.
  3. Are you suggesting a refine method on each kernel class? Arguably one would put such a method on a Coreset class, but that doesn't fit with our current plan. I'm not sure what you mean re the Kernel constructor, although it's likely I've missed some context somewhere.
  4. Question for @tl82649!
  5. I'm guessing that it's also there for if the user wants to create the matrix in a completely different way to the recommended approach, more than it's a cache, but I could be wrong. If we're moving towards a Kernel class then we could in theory handle caching within the instance, but I agree that we need to know if it still needs to be exposed to the user.
  6. Great question! Some of the functions have arguments which the user may not wish to pass around, such as the precision threshold. Technically a user could just make use of functools.partial to give themselves a "frozen" metric function if that mattered, but I tend to err on the side of classes here for two reasons:
    1. It allows us to enforce a generic metric "call" by placing non-shared arguments within the instance.
    2. I find it easier to "OOP" things as classes, and then if we deem them redundant later on once we've transformed the repo (i.e., only ever __init__ and another method) then we can always go back for simplicity. Think of it as a refactoring method than a final state!
  7. Question for @tl82649
tp832944 commented 1 year ago

Thanks for your input @gchqdev227 .

  1. Sounds like we should try to avoid a SteinKernelHerding class. I don't agree with your reason for avoiding ABCs - the parent should never be instantiated, so you're basically writing something that functions as an ABC using NotImplementedErrors. Anyway, this is moot for KernelHerding.
  2. Putting the various refine methods on Coreset is a much better idea. There are currently three refine functions: standard, rand and rev. We could add these as three methods to Coreset but I think one method that takes a parameter of an instantiated Refine class is neater and more flexible. The Refine class would still have a method that could be called directly and would in fact be called by Coreset.refine(). 4...
  3. .
  4. Your argument is reasonable, so let's go for OOP.
tp832944 commented 1 year ago

@gchqdev227 @tl82649 Do we need separate RBFKernel and NormalisedRBFKernel?

tp832944 commented 1 year ago

@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design. oop_design_20230920

tp832944 commented 1 year ago

@tl82649 Some one more query:

tp832944 commented 1 year ago

The purpose of a Data ABC is that some types of input data need more preprocessing, e.g. PCA, than others due to their size. We need more design work to consider how this is going to work. Coreset really only wants to take an array as input data, so the Data class should really be about preparing an array from whatever input it is designed to process.

tl82649 commented 1 year ago

@tp832944 Some responses in turn.

  1. Do we want a SteinKernelHerding class in addition to standard KernelHerding? [...]

No. Stein herding is simply herding with the Stein kernel. I actually think we should -- at some point, possibly post-OOP -- switch the herding examples to use simpler kernels, since the Stein kernel is a complicated introduction given its dependence on the score function (see #150). Also, the Stein kernel is more appropriate for the putative SteinThinning class. Stein thinning uses a slightly different herding-esque method that utilises a key property of the Stein kernel, i.e. avoiding integrals w.r.t. $\mathbb{P}$.

  1. There are some unused functions for standard kernel herding body and Frank-Wolfe kernel herding body. Do we want to keep these?

Yes, remove. These are hallmarks of a Hilbert space method that we haven't advanced yet. We can always bring them back in for later methods.

  1. There are three functions in refine.py for different ways to refine a coreset. [...] Alternatively, we could create a Refine ABC with three subclasses. [...]

Refine ABC. The kernel simply defines the metric for choosing points.

  1. Will the implementation of refine methods always be independent of the coreset method?

Yes. Any use of refine steps as a component of a coreset method, e.g. in kernel thinning, would still make use of the refine method in the putative Coreset ABC object.

  1. For functions like grad_x of a kernel, is the gram_matrix parameter only there to avoid recalculating if it's already been calculated in the parent function, or could it in theory be anything?

Exactly, it's to avoid expensive recomputation if computation has already been done.

  1. Is there any need to OOP metrics? These feel like pure functions.

Another argument for OOP here might be the distinction between mathematical metrics, i.e. those that obey the properties of a metric, and quality metrics, e.g. KL-divergence (which is asymmetric). Could we envisage a method that utilises the properties of integral probability metrics?

  1. @tl82649 proposed a DatasetReduction class. What existing functions should become part of this class?

I think refine should actually sit here, since it's not specific to coresets, i.e. it could be used in distillation methods. Thinking more on this, the refine(A, B) method should take in two sets of points A and B, and return a set of points C that consist of some points of A and B that greedily improve a metric, e.g. MMD, over B alone.

Also possibly fit and save from Coreset ABC? The key discriminant of a coreset is the storage of indices over points, which should justify the continued use of the Coreset ABC with the aforementioned methods moved up to the DatasetReduction class.

tl82649 commented 1 year ago

@gchqdev227 @tl82649 Do we need separate RBFKernel and NormalisedRBFKernel?

@tp832944 Not strictly. The former was intended as a standalone kernel for herding, etc. The latter is a convenience function intended for explicit score functions with KDEs, e.g. following sklearn's KernelDensity. They call it a Gaussian kernel, which is different to a Gaussian RBF.

Perhaps we should be explicit and use GaussianDensityKernel instead of NormalisedRBFKernel? Then we could use RBFKernel as an abstract class since technically it refers to a group of radial basis functions.. What we call RBFKernel can then be renamed to SquaredExponentialKernel(RBFKernel) (to be consistent with common terminology) or GaussianKernel(RBFKernel) (to be consistent with the Wikipedia definition), and the other kernels, e.g. PC-IMQ, can also inherit from RBFKernel.

tl82649 commented 1 year ago

@tl82649 Some one more query:

  • Which existing function maps onto weights.MMD? I can't find anything suitable in docstrings.

@tp832944 Technically weights.simplex_weights, since any solution to that quadratic form minimises MMD w.r.t. the supplied kernel. The simplex constraint gives a reduced solution set. To minimise KSD for example, we need the non-negativity constraint and a different objective function, cf. #134.

tl82649 commented 1 year ago

@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design. oop_design_20230920

@tp832944 Looks mostly fine to me. Some comments:

tp832944 commented 1 year ago
  1. For functions like grad_x of a kernel, is the gram_matrix parameter only there to avoid recalculating if it's already been calculated in the parent function, or could it in theory be anything?

Exactly, it's to avoid expensive recomputation if computation has already been done.

@tl82649 This sounds like an opportunity for automated caching to save the user having to precompute and pass around an extra variable. The snag is it depends on x and y.

We could implement with an LRU cache, or assume x and y are immutable (sounds dangerous), or add x and y to the Kernel class. If x and y are on the Kernel class, either a new instance is required for each x and y (perhaps call it KernelInstance) or use property setters to invalidate any cached Gram matrix whenever x or y is changed. If we're only likely to need a cache size of one, my final option seems very tempting.

However, this sounds like it wouldn't work well with Jax, which expects pure functions and repeated calls of the same function in the case of vectorisation. Therefore, I feel we may be stuck with offering the user to pass in the Gram matrix each time for a performance boost but with no nice way to do that automatically.

Do you agree with my conclusion? Do you have any alternative ideas?

tp832944 commented 1 year ago

@tl82649 @gchqdev227 @pc532627 @jd750687 Here is an updated version of @tl82649's diagram including my thoughts. In addition to my outstanding queries above, comments please on this OOP design. oop_design_20230920

@tp832944 Looks mostly fine to me. Some comments:

  • I suggest moving fit, refine, save, original_data and possibly render from Coreset ABC to DatasetReduction following my comments here
  • I don't think fit should have a kernel argument. Not all coreset (or dataset reduction) methods will use kernels. Perhaps kernel should be passed as an initialisation argument for (and set as an attribute of) kernel methods, e.g. KernelHerding? Or an intermediary parent class, e.g. KernelCoreset(Coreset ABC)?

@tl82649 Is what I intended by Coreset.fit() the same as DatasetReduction.size_reduce()?

I'm happy to move fit() etc. to DatasetReduction.

tp832944 commented 1 year ago

@tl82649 @gchqdev227 @pc532627 @jd750687 I believe I have now addressed all points, except that I have run out of time to produce a full class diagram. Thank you for your input. This design is definitely not final but is close enough to start implementing. Happy to take further comments.

oop_design_20230922

How to call david.py

orig_data = data_handler.load("david.png")
coreset = orig_data.reduce("map_reduce", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset.render()
random_points.render()
tp832944 commented 1 year ago

Reopening for some design changes.

tp832944 commented 1 year ago

Updated design covering various factories, especially in reduction.py.

oop_design_20231004

How to call david.py

orig_data = load_file("david.png")
coreset = orig_data.reduce("map reduce", "kernel herding", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("size reduce", "random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset.render()
random_points.render()
tp832944 commented 1 year ago

@bk958178 @sk062380 Some good structures to follow from #183:

tp832944 commented 1 year ago

Updated OOP design reintroducing ClassFactory, now in util.py. Sometimes you just have to try before you realise there's a better way. Sorry for the U-turn.

oop_design_20231017

tp832944 commented 1 year ago

@tl82649 I think you specified weighted as an attribute on what has become the DataReduction ABC. Based on examples/weighted_herding.py, this should be boolean, but when True does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option of None to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?

tl82649 commented 1 year ago

@tp832944

@tl82649 I think you specified weighted as an attribute on what has become the DataReduction ABC. Based on examples/weighted_herding.py, this should be boolean, but when True does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option of None to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?

Yes, this should specify a method, since I envisage a few optimisation methods to be added in future; in addition to the two currently implemented. None is sensible for no weights.

The weighted_herdiing.py example calls coreax.util.solve_qp directly since it already has the requisite data structures to hand (simplex_weights computes these as part of the routine).

I suggest two additional arguments to simplex_weights, namely its kbar and Kmm, which can be optionally supplied if already computed. weighted_herding.py should then call simplex_weights accordingly.

tp832944 commented 12 months ago

@tp832944

@tl82649 I think you specified weighted as an attribute on what has become the DataReduction ABC. Based on examples/weighted_herding.py, this should be boolean, but when True does not call either function in weights.py. Should this instead be specifying one of the options in weights.py with the option of None to not calculate weights? In which case, I think the weights method in the example needs adding to weights.py - or is it the same as Simplex weights?

Yes, this should specify a method, since I envisage a few optimisation methods to be added in future; in addition to the two currently implemented. None is sensible for no weights.

The weighted_herdiing.py example calls coreax.util.solve_qp directly since it already has the requisite data structures to hand (simplex_weights computes these as part of the routine).

I suggest two additional arguments to simplex_weights, namely its kbar and Kmm, which can be optionally supplied if already computed. weighted_herding.py should then call simplex_weights accordingly.

Thanks @tl82649 , I've added these changes to the next update to the OOP diagram.

tp832944 commented 12 months ago

Here is an update accounting for a few changes around weights, as above.

oop_design_20231030

tp832944 commented 11 months ago

@tl82649 @gchqdev227 @pc532627 @sk062380 @bk958178 @jd750687 Here is the latest design. oop_design_20231101

How to call david.py

orig_img: numpy.ndarray = cv2.imread("david.png")
orig_data = Image.load(orig_image)
coreset = orig_data.reduce("map_reduce", "kernel_herding", 8000, kernel=SteinKernel(PCIMQKernel))
random_points = orig_data.reduce("map_reduce", "random", 8000)
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset_plot = coreset.render()
coreset_plot.savefig("david_coreset.png")
random_points.render()
tp832944 commented 11 months ago

@sk062380 Diagram updated to move reduction_indices from DataReduction to Coreset because the reduced data need not be a subset of the original for distillation.

@jd750687 I've made some minor edits to the wording of DataReader but nothing for you to change.

Image

tp832944 commented 11 months ago

@jd750687 Added DataReader.restore_dimension, improved specification of DataReader.reduce_dimension and added _dimension_reduction_meta.

RefineRev renamed to RefineReverse.

oop_design_20231121

tp832944 commented 11 months ago

The latest version from working through #167. Changes include a construct_kernel function and more parameters to map_reduce.

oop_design_20231122

tp832944 commented 11 months ago

Thought for the future. This is all configured for reducing two-dimensional matrices. In theory, some of these methods would work in higher dimensions, although would probably require new versions. Maybe even a fork of this repo.

tp832944 commented 11 months ago

Some more updates:

oop_design_20231129

tp832944 commented 10 months ago

@tl82649 @gchqdev227 @pc532627 @sk062380 @bk958178 @jd750687 I've made a few modifications to get MapReduce working. What do you think of the following?

How to call david.py

orig_img: numpy.ndarray = cv2.imread("david.png")
orig_data = Image.load(orig_image)
kernel_herding = KernelHerding(kernel=SteinKernel(PCIMQKernel()))
kernel_herding.fit_with_strategy(orig_data, MapReduce(8000))
random_points = RandomSample()
random_points.fit_with_strategy(orig_data, SizeReduce(8000))
print(f"Coreset MMD = {coreset.compute_metric("MMD")}")
print(f"Random MMD = {random_points.compute_metric("MMD")}")
orig_data.render()
coreset_plot = coreset.render()
coreset_plot.savefig("david_coreset.png")
random_points.render()

oop_design_20231218

tp832944 commented 10 months ago

oop_design_20231222

tp832944 commented 9 months ago

Image