fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
205 stars 54 forks source link

Should Empirical PMF functions be usable with generic keys? #245

Closed HarryMcCarney closed 1 year ago

HarryMcCarney commented 1 year ago

I can create this

#r "nuget: fsharp.stats"
open FSharp.Stats
open FSharp.Stats.Distributions

let letters =
    "mississippi".ToCharArray()
    |> Array.map string
    |> Array.toList
    |> Frequency.createGeneric
    |> Empirical.ofHistogram

But then cant get probability for specific value as all functions except ofHistogram take a float as the map key. I can work around this by querying the map directly with letters["i"]. But then letters["z"] returns an error instead of a zero.

Would prefer to use probabilityAt but this expects Map<float,float>. Should this function be generic or have I missed something?

thanks

bvenn commented 1 year ago

I will have a look at this. Maybe there is a performance advantage if you explicitly restrict it to float. If so, there should be additional "generic" functions. I'll test it and make the functions usable for "non-float" lists as well.

That you don't have access to non-float letters in your case is hard to work around in the module. There are a lot of possible alphabets that could be considered (upper case, lower case, äüö, special characters, numbers). I assume you have to add your desired set of characters separately by:

let myAlphabet = 
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ".ToCharArray()

With this at hand you can use this as template and just replace counts of characters that are existing in your text.

#r "nuget: FSharp.Stats"
#r "nuget: Plotly.NET"

open FSharp.Stats
open FSharp.Stats.Distributions
open Plotly.NET

let myAlphabet = 
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ".ToCharArray()

let myTextMap = 
    "mississippi".ToCharArray()
    |> List.ofArray
    |> Frequency.createGeneric

let myFinalMap = 
    // use your own defined alphabet to include the desired set of characters
    myAlphabet
    |> Array.map (fun key -> 
        // if the text contains the current character, its value is used
        if myTextMap.ContainsKey key then 
            key,myTextMap.[key] 
        // if the text does NOT contain the current character, set its count to 0
        else 
            key,0
        )
    |> Map.ofArray

// accession of character frequencies    
myFinalMap.['z'] // 0
myFinalMap.['s'] // 4

// visualization
myFinalMap
|> Map.toArray
|> Chart.Column
|> Chart.withSize (1000.,500.) // quick way to depict all characters
|> Chart.show

image

I'll comment if I have any news.

bvenn commented 1 year ago

I fixed the issue, tested the Empirical.create function, and added a convenience layer for nominal/categorical inputs.

32fa0c23f2629dd9c149b4d98bc9c0befea86ad2

060f696a9e8f8bad7542bf35bb5ba885f560d574

7c1242dbe65710142e70e3c823bb46afeacafffd

still missing

Usage

You can build the binaries yourself or wait for the next FSharp.Stats release. (Update: You can use #r "nuget: FSharp.Stats, 0.4.12-preview.1")

Define the set of characters to search for:

#r @"<PathToFSharp.Stats>\FSharp.Stats\src\FSharp.Stats\bin\Release\netstandard2.0\FSharp.Stats.dll"
#r "nuget: Plotly.NET"

open FSharp.Stats
open FSharp.Stats.Distributions
open Plotly.NET

let letters = "Mississippi"

// Define your set of characters that should be checked for
// Any character that is not present in these sets is ignored
let myAlphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" |> Set.ofSeq
let mySmallAlphabet = "abcdefghijklmnopqrstuvwxyz" |> Set.ofSeq

These alphabets can be used to create the probability maps.

//takes the characters and determines their probabilities without considering non-existing characters
let myFrequencies0 = EmpiricalDistribution.createNominal() letters

//takes upper and lower case characters and determines their probability
let myFrequencies1 = EmpiricalDistribution.createNominal(Template=myAlphabet) letters

//takes only lower case characters and determines their probability
let myFrequencies2 = EmpiricalDistribution.createNominal(Template=mySmallAlphabet) letters

An additional field for transforming the input sequence may be beneficial if it does not matter if an character is lower case or upper case:

//converts all characters to lower case characters and determines their probability
let myFrequencies3 = EmpiricalDistribution.createNominal(Template=mySmallAlphabet,Transform=System.Char.ToLower) letters

// check probability of non existing characters, that are within the search scope (Template alphabet)
myFrequencies3.['z'] //returns 0.0

Visualization


[
Chart.Column(myFrequencies0 |> Map.toArray,"noTemplate") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies1 |> Map.toArray,"bigAlphabet") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies2 |> Map.toArray,"smallAlphabet") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies3 |> Map.toArray,"toLower + smallAlphabet") |> Chart.withYAxisStyle "probability"
]
|> Chart.Grid(4,1)
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withTitle letters
|> Chart.withSize(1000.,900.)
|> Chart.show

image

bvenn commented 1 year ago

A prerelease is published and can be used:

#r "nuget: FSharp.Stats, 0.4.12-preview.1"

The documentation that contains the same information as this thread can be found here.

HarryMcCarney commented 1 year ago

Thanks Benedikt, nice solution!