fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 196 forks source link

Stats.cov throws if Stats.stdDev contains missing values #551

Open Choc13 opened 1 year ago

Choc13 commented 1 year ago

Calling Stats.cov on a frame for which Stats.stdDev returns <missing> values throws an exception due to mismatched matrix dimensions.

Here's an example that shows this:

#r "nuget: Deedle.Math"

open System
open Deedle
open Deedle.Math
open MathNet.Numerics.LinearAlgebra

let googReturns =
    series [ DateTime(2022, 10, 28) => 0.1
             DateTime(2022, 10, 29) => 0.11
             DateTime(2022, 10, 30) => 0.09 ]

let msftReturns =
    series [ DateTime(2022, 10, 29) => -0.01
             DateTime(2022, 10, 30) => 0.2 ]

let returns =
    Frame.ofColumns [ "GOOG" => googReturns
                      "MSFT" => msftReturns ]

// This one works fine because the stdDev is defined for all columns as they have 2+ values. It returns,
// DenseMatrix 2x2-Double
//      0.0001        0
//      0        0.02205
returns |> Stats.cov

let fbReturns =
    series [ DateTime(2022, 10, 30) => 0.01 ]

let returnsWithFb = returns |> Frame.addCol "FB" fbReturns

// This returns `series [ GOOG => 0.009999999999999796; MSFT => 0.148492424049175; FB => <missing>]`
returnsWithFb |> Stats.stdDev

// This subsequently throws
returnsWithFb |> Stats.cov

Which results in the following error:

System.ArgumentException: Matrix dimensions must agree: op1 is 2x2, op2 is 3x3.

I was expecting it to instead return something like

series [ GOOG => series [ GOOG => 9.999999999999591E-05; MSFT => 0; FB => <missing>]; MSFT => series [ GOOG => 0; MSFT => 0.022050000000000007; FB => <missing>]; FB => series [ GOOG => <missing>; MSFT => <missing>; FB => <missing>]]

I've only just started using this library, so perhaps this is the intended behaviour, but the docs lead me to believe that the philosophy of this library was to propagate missing values where possible.

I did inline and modify the Stats.cov function locally and got it to output the frame I was expecting with the <missing> values by changing it to:

let covMatrix (df: Frame<'R, 'C>) =
    // Treat nan as zero, the same as MATLAB
    let corr =
        df
        |> Stats.corrMatrix
        |> Matrix.map (fun x -> if Double.IsNaN x then 0. else x)

    let stdev =
        df
        |> Stats.stdDev
        |> Series.valuesAll // <-- This and the following line were the changes I made, I feel like there might be a neater / more efficient way to do this
        |> Seq.map (Option.defaultValue nan)
        |> Array.ofSeq

    let stdevDiag = DenseMatrix.ofDiagArray stdev
    stdevDiag * corr * stdevDiag

I'd be happy to open a PR for this change or similar if this <missing> value propagation behaviour is correct.

zyzhu commented 1 year ago

This is indeed a bug. Your fix is perfectly fine. Would you mind opening a PR? Thanks!

Choc13 commented 1 year ago

Not a problem. Will submit a PR this afternoon. Thanks for confirming.