alex-hhh / data-frame

A data frame implementation for Racket
https://alex-hhh.github.io/2018/08/racket-data-frame-package.html
Other
37 stars 10 forks source link

least-squares: matrix-solve contract violation #14

Open bennn opened 1 year ago

bennn commented 1 year ago

What program did you run?

#lang racket

(require data-frame)

(define df
  (let* ((df (make-data-frame))
         (xs (make-series "x" #:data '#(0 1)))
         (ys (make-series "y" #:data '#(2 3))))
    (df-add-series! df xs)
    (df-add-series! df ys)
    df))

(df-least-squares-fit df "x" "y" #:mode 'polynomial)

What happened?

matrix-solve: contract violation
  expected: matrix-invertible?
  given: (array #[#[2 1 1] #[1 1 1] #[1 1 1]])
  argument position: 1st
  other arguments...:
   (array #[#[5] #[3] #[3]])

What did you expect to happen?

No error.

I'm not sure what the problem is. Too few points? I would rather have a poor fit (straight line) than an error. But if an error is the right thing, can we clear up the message?

alex-hhh commented 1 year ago

The problem is that you are trying to fit a second-degree polynomial with only 2 points. Depending on how you look at it, this either has no solution or has an infinite number of solutions :-)

You can either use a linear fit, or specify the degree of the polynomial (which defaults to 2)

(df-least-squares-fit df "x" "y" #:mode 'linear)
(df-least-squares-fit df "x" "y" #:mode 'polynomial #:polynomial-degree 1)

I agree, however that the error message could be improved.

bennn commented 1 year ago

Is the general rule that the degree can't be smaller than the number of points?

alex-hhh commented 1 year ago

Is the general rule that the degree can't be smaller than the number of points?

You need at least N + 1 data points to fit a polynomial of degree N. This is because for a N degree polynomial, there are N + 1 coefficients, so you need N + 1 equations, in the form y = P(x) to find these coefficients. This is easiest to understand for a first-degree polynomial which is a line:

Unfortunately, we cannot simply add a contract to check that the polynomial degree must be less than the number of points: we need N + 1 distinct data points. For example, you cannot find a "least squares fit" line for a data set that contains the 3 points (0, 0), (0, 0), (0, 0).

A contract for distinct points would not work either: Depending on how matrix solve algorithms are implemented in the racket/math library, it might be possible to have points which are different (i.e., not equal in the = or fl= sense), yet close to each other so that the matrix cannot be inverted, and a solution cannot be found.

bennn commented 1 year ago

Hmm, ok. What about catching exceptions from math and re-raising them with a brief message like "unable to solve, may be too few distinct points in the dataset" ?

alex-hhh commented 1 year ago

a brief message like "unable to solve, may be too few distinct points in the dataset" ?

Such a message might confuse the user, and I had too many bad experiences with such messages. If an exception is to be thrown, I would rather have the original exception, rather than a higher level routine's interpretation of it -- for example the exception might not be related to "solving" at all.

Here is a dataset with two clearly distinct data points that will fail to find a linear fit:

#lang racket

(require data-frame)

(define df
  (let* ((df (make-data-frame))
         (xs (make-series "x" #:data '#(0 "1")))
         (ys (make-series "y" #:data '#(2 "3"))))
    (df-add-series! df xs)
    (df-add-series! df ys)
    df))

(df-least-squares-fit df "x" "y" #:mode 'polynomial #:polynomial-degree 1)

While the above example is somewhat contrived, this situation can easily happen in real life: some programs export data as CSV files and sometimes they quote numbers inside these files. For such CSV files, numbers will he read as strings, and the user needs to pass #t to #:quoted-numbers? in df-read/csv to read them as numbers.

bennn commented 1 year ago

Oh, by re-raise I meant to keep the original exception message. In other words, throw the actual exception and add a commentary on top.