fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
205 stars 54 forks source link

[Feature Request] `Frequency` merge operations #263

Open bvenn opened 1 year ago

bvenn commented 1 year ago

Merge operations for Maps

Data can be sorted into bins of predefined width using the Frequency or EmpiricalDistribution module. If two datasets are binned and should be merged afterwards, several merging strategies are possible. A simple merge of freqA and freqB is straightforward with keys that are present in freqA and freqB are replaced with the values of freqB.

let a =
    [("k1",1);("k2",3)]
    |> Map.ofList

let b =
    [("k2",1);("k3",4)]
    |> Map.ofList

merge a b

results in the following combination with ("k2",3) from a being replaced by ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 1); ("k3", 4)]

Generic formulation of merge operations

I'm in the process of adding a generic function that gets an additional function that handles key duplicates. E.g.:

add a b

resulting in the combination of a and b with ("k2",3) from a being added to ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 4); ("k3", 4)]

While this is trivial, I'm not sure how to handle a subtraction. Should the result from subtract a b result in:

The latter option (b) makes no sense to me since frequency counts should not be negative, but I cannot think of applications in which the result of (a) makes any sense. Maybe the subtract function is not the best to start with because in this post they implemented (a) with addition and multiplication examples. Especially for the addition, a and b would give the correct result and I think it is intuitive to just apply the function to values of keys that are present in both maps.

@HarryMcCarney, do you know use cases that use subtract? Do you have any thoughts about this? I would suggest to add version (a) to Frequency as well as EmpiricalDistribution

Additional remark: When applied to continuous data bandwidths must be equal, to not merge counts from overlapping bins!

bvenn commented 1 year ago

A difficulty that I'm not sure how to tackle is the required bandwidth equality on continuous data. If you want to add the following histograms:

let a =
    [(0.1,1);(0.2,1);(0.3,1)] //bandwith = 0.1
    |> Map.ofList

let b =
    [(0.15,1);(0.3,1)] //bandwidth = 0.15 or 0.05, nobody knows..
    |> Map.ofList

merge a b  
// result: [(0.1,1);(0.15,1);(0.2,1);(0.3,2)] is not valid!!

Histograms (regardless if they are Frequencies or EmpiricalDistributions) that should be merged, have to have the same bandwidth. For categorical data this is no issue!

Solution

bvenn commented 1 year ago

For now I decided to go with an unsatisfactory hybrid of (B) and (C). I added a parameter that requests the user to specify if the maps are based on equal binning or if it is categorical data. If its continuous data with unequal binning, the merge fails with a description explaining the issue.

In future a procedure could be implemented that dissect both maps and creates a new one with a new binning. If my understanding is correct the bandwidth must be double the maximal bandwidth that is observed in the input maps.