[Feature Request] `Frequency` merge operations

bvenn commented 1 year ago

Merge operations for Maps

Data can be sorted into bins of predefined width using the Frequency or EmpiricalDistribution module. If two datasets are binned and should be merged afterwards, several merging strategies are possible. A simple merge of freqA and freqB is straightforward with keys that are present in freqA and freqB are replaced with the values of freqB.

let a =
    [("k1",1);("k2",3)]
    |> Map.ofList

let b =
    [("k2",1);("k3",4)]
    |> Map.ofList

merge a b

results in the following combination with ("k2",3) from a being replaced by ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 1); ("k3", 4)]

Generic formulation of merge operations

I'm in the process of adding a generic function that gets an additional function that handles key duplicates. E.g.:

add a b

resulting in the combination of a and b with ("k2",3) from a being added to ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 4); ("k3", 4)]

While this is trivial, I'm not sure how to handle a subtraction. Should the result from subtract a b result in:

a) val it: Map<float,int> = map [("k1", 1); ("k2", 2); ("k3", 4)]
- counts from a are subtracted by the corresponding values from b if keys are present in both maps
- here the values of a that are not present in b are untouched
b) val it: Map<float,int> = map [("k1", 1); ("k2", 2); ("k3", -4)]
- counts from a are subtracted by the values from b, even for keys that are not present in a

The latter option (b) makes no sense to me since frequency counts should not be negative, but I cannot think of applications in which the result of (a) makes any sense. Maybe the subtract function is not the best to start with because in this post they implemented (a) with addition and multiplication examples. Especially for the addition, a and b would give the correct result and I think it is intuitive to just apply the function to values of keys that are present in both maps.

@HarryMcCarney, do you know use cases that use subtract? Do you have any thoughts about this? I would suggest to add version (a) to Frequency as well as EmpiricalDistribution

Additional remark: When applied to continuous data bandwidths must be equal, to not merge counts from overlapping bins!

bvenn commented 1 year ago

A difficulty that I'm not sure how to tackle is the required bandwidth equality on continuous data. If you want to add the following histograms:

let a =
    [(0.1,1);(0.2,1);(0.3,1)] //bandwith = 0.1
    |> Map.ofList

let b =
    [(0.15,1);(0.3,1)] //bandwidth = 0.15 or 0.05, nobody knows..
    |> Map.ofList

merge a b  
// result: [(0.1,1);(0.15,1);(0.2,1);(0.3,2)] is not valid!!

Histograms (regardless if they are Frequencies or EmpiricalDistributions) that should be merged, have to have the same bandwidth. For categorical data this is no issue!

Solution

(A) introduce a Frequency/EmpiricalDistribution type that contains the frequency map as well as a bandwidth field that can be checked when merged
- downside: When dealing with categorical data this is totally useless.
(B) Leave it as it is and properly document this behaviour and trust the user not to merge histograms of differing bandwidth
- downside: Can you make the user responsible for this issue?
(C) Check the bandwidth when merged
- this is not possible in the current form because bin can be missing in the map structure and the bandwidth would be determined wrongly
(D) Add parameters to merge functions that requests the user to state the used bandwidth of both histograms. Nothing is done with these bandwidth except to check and fail if they do not match. This adds irrelevant ceremony to the function call, but ensures to not merge histograms of differing bandwidth.
- downside: For categorical data however this parameter would be hard to define

bvenn commented 1 year ago

For now I decided to go with an unsatisfactory hybrid of (B) and (C). I added a parameter that requests the user to specify if the maps are based on equal binning or if it is categorical data. If its continuous data with unequal binning, the merge fails with a description explaining the issue.

In future a procedure could be implemented that dissect both maps and creates a new one with a new binning. If my understanding is correct the bandwidth must be double the maximal bandwidth that is observed in the input maps.

fslaborg / FSharp.Stats