gchq / coreax

A library for coreset algorithms, written in Jax for fast execution and GPU support.
Apache License 2.0
15 stars 1 forks source link

docs: simplify `Coreax.Coreset` docstring maths #684

Open tc85324 opened 3 days ago

tc85324 commented 3 days ago

PR Type

Description

Improves the presentation and clarity of the mathematics within the Coreax.Coreset docstring. This should make it much easier for users and contributors to understand our definitions and/or assumptions about "coresets".

If you have any ideas on how to make the documentation even simpler, or more mathematically precision, I would be keen to discuss them here.

How Has This Been Tested?

Documentation builds.

Does this PR introduce a breaking change?

N/A

Checklist before requesting a review

tc85324 commented 3 days ago

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

db091756 commented 4 hours ago

Given the definitions here, should we be checking that the weights are non-negative in some kind of __check_init__? More generally, should this check be on the Data classes. I.E. should we only allow handling of non-negative weights in Data? If we don't expect any of the current or future solvers/metrics to handle negative weights, then I don't think we would be losing anything by adding this constraint to Data.

3.1.1 from https://arxiv.org/pdf/1204.1664 gives good reasoning for why non-negative and normalised weights should probably not be enforced in Coreax.

db091756 commented 4 hours ago

The definition given in the docstring is very good, but seems quite linked to the way in which the recombination discusses coresets. e.g. not all coreset algorithms will satisfy the preservation of the CoM.

I think trying to get an extremely mathematically precise definition of a coreset is going to be very hard as coresets are discussed and introduced in very different ways in the literature. The TLDR:

"a coreset is a reduced set of :math:\hat{n} (potentially weighted) data points that, in some sense, best represent the "important" properties of a larger set of :math:n > \hat{n} (potentially weighted) data points."

is a great (not super mathematical) summary of what a coreset is. When we start talking about measures things could get more confusing for users as there exist many coreset algorithms out there, including probably the most popular in Kernel Herding, which don't discuss measures at all but instead talk in terms of distributions.

I am not sure what the best action would be here, but I would be tempted to keep things as simple as possible, and transfer the measure discussion to the Tree recombination docstring.

Perhaps following the TLDR we could add a few (as brief as possible without being misleading) examples of what the "important" properties are for specific algorithms such as CoM for Tree Recombination or the MMD for Kernel Herding.