fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
210 stars 56 forks source link

Ways to deal with nan in matrix operations #113

Open nhirschey opened 3 years ago

nhirschey commented 3 years ago

Is your feature request related to a problem? Please describe. In R, when you want to calculate a variance-covariance matrix, you can pass in options via a "use" for what to do in the case of missing values: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/cor . For instance, use="pairwise.complete".

It would be nice to have this option in FSharp.Stats and have it work for float nan and float Option types. Though I'm not sure how best to mix new parameters into the Matrix operations api. Something along the lines of the Plotly.NET options might work, Matrix.columnSampleCovarianceMatrixOf Complete A, Matrix.columnSampleCovarianceMatrixOf PairwiseComplete A, etc.

Describe the solution you'd like

For example, in R I can do

> A <- matrix(c(1,2,1.1,2.9,NA,1.5),ncol = 2, byrow = TRUE)
> A
     [,1] [,2]
[1,]  1.0  2.0
[2,]  1.1  2.9
[3,]   NA  1.5
> cov(A)
     [,1]      [,2]
[1,]   NA        NA
[2,]   NA 0.5033333
> cov(A,use="pairwise.complete")
      [,1]      [,2]
[1,] 0.005 0.0450000
[2,] 0.045 0.5033333

In FSharp.Stats right now, I have to do the following, and it is not straightforward to do it programmatically with functions like Matrix.mapiRow.

> let A = matrix [[1.0;2.0];[1.1;2.9];[nan; 1.5]];;
val A : Matrix<float> = matrix [[1.0; 2.0]
                                [1.1; 2.9]
                                [nan; 1.5]]

> let completeCov = Matrix.columnSampleCovarianceMatrixOf A;;
val completeCov : Matrix<float> = matrix [[nan; nan]
                                          [nan; 0.5033333333]]

> completeCov ;;
val it : Matrix<float> = matrix [[nan; nan]
                                 [nan; 0.5033333333]]

> // To get pairwise complete
- ;;
> let pairwiseCompleteA = Matrix.removeRowAt 2 A;;
val pairwiseCompleteA : Matrix<float> = matrix [[1.0; 2.0]
                                                [1.1; 2.9]]

> let pairwiseCov = Matrix.columnSampleCovarianceMatrixOf pairwiseCompleteA;;
val pairwiseCov : Matrix<float> = matrix [[0.005; 0.045]
                                          [0.045; 0.405]]

> pairwiseCov;;
val it : Matrix<float> = matrix [[0.005; 0.045]
                                 [0.045; 0.405]]

> pairwiseCov.Item (1,1) <- completeCov.Item (1,1);;
val it : unit = ()

> pairwiseCov;;
val it : Matrix<float> = matrix [[0.005; 0.045]
                                 [0.045; 0.5033333333]]
bvenn commented 3 years ago

Thanks for reporting this issue. I had a look at it and indeed there is no trivial anwer to it. In general we try to add specialized functions to handle nan values by adding a[functionname]NaN function. In this special case I agree, that generating a covariance matrix vom sparse data takes several steps that should be capsuled in a special function. The current implementation is based on vector and matrix multiplications. The most transparent way for users to understand the workflow is the one you suggested. I'm going have another look at it and try fixing it.

Thanks again for reporting.