Getting the mean of a variable #47

lf-araujo commented 4 years ago


Is it possible to calculate the mean of a variable? For instance, in my toy data frame I have a variable named X. The below errs:

import ggplotnim

let df = toDf(readCsv("./data/autoInsurSweden.csv"))
var a ="X")


Error: type mismatch: got <DataFrame>
but expected one of: 
proc mean(v`gensym542432: PersistentVector[Value]): Value
  first type mismatch at position: 1
  required type for v`gensym542432: PersistentVector[formula.Value]
  but expression 'a' is of type: DataFrame

expression: mean(a)

Is it possible to get the mean of one of the variables within the df?

Vindaar commented 4 years ago

There aren't any convenience functions that operate on full data frames in such a way at the moment.

But there are 2 ways you can do this.

Apply a proc to a full column (a PersistentVector[Value])

You perform the calculation (e.g. mean) on a DF column, like so:

let mean = df["X"].mean
           ^--- get column "X", type `PersistentVector[Value]`
                   ^--- apply desired function of type 
                        `PersistentVector[Value] -> Value`.

The result is simply a Value (use e.g. toFloat to get a normal float from it).

The possible functions are those lifted here:

You can use the lift[Vector|Scalar][String|Int|Float]Proc templates the lift any other proc you like. The Vector templates are for normal procs with signature proc [T](x: seq[T]): T (thus map a sequence to a scalar) and the Scalar templates simply proc [T](x: T): T (i.e. a transformation of the value x). Note: in some cases you might want to only lift a proc locally. By default the templates produce exported procs, which are only allowed at top level. To lift a proc locally use the toExport = false argument (it's a static bool!).

Use summarize to reduce the data frame

this is more in line with your question actually. You can apply a function such as the one above using the summarize proc:

echo df.summarize(f{"X_mean" ~ mean("X")})

where summarize takes a single (or several) functions. The result will be a data frame, which is reduced to a single row, due to the application of a Vector like proc in the aforementioned sense. Since the result here is a full data frame, in order to get the actual mean value, you can do:

let dfMean = df.summarize(f{"X_mean" ~ mean("X")})
echo dfMean["X_mean"][0] # since there's only 1 entry anyways

See the (sorry for the bad documentation) documentation here:

summarize is useful if you want to combine this with some other operation. Especially group_by is special in that regard. If a grouped data frame is handed to summarize the operation will be done for each group! So if you had a DF with a classification column "class" with elements {"A", "B", "C", "D"}:

echo df.group_by("class").summarize(f{"X_mean" ~ mean("X")})

the result would be the means of the 4 classes.

lf-araujo commented 4 years ago

Here is a follow up question.


echo df["X"].variance

I get:


Variance (from the stats module) expects an openArray, which I can't lift, it seems.


Vindaar commented 4 years ago

Ah, you're right. That's an omission on my part. Indeed, I haven't lifted any of the procs from the stats module.

edit2: Oh, wow. I completely missed:

which I can't lift,

I'll fix the lifting templates later today to work on openArray!

what lifting means?

Lifting in this context just means to take a proc with signature proc [T](s: seq[T]): T and turn it into one with proc (v: PersistentVector[Value]): Value. Or for scalar procs just to make it work on Value types (if the proc is generic proc [T](s: T): T it shouldn't even be required. But since many procs aren't the Scalar templates are there too).

before edit:

To use it you have to lift it by putting the following at top level in your code:


edit: I'll add those sometime later today to the default lifted procs.

Vindaar commented 4 years ago

Ok, finally on a computer to check this.

Lifting a proc that takes openArray works as expected. Maybe I misunderstood you and you meant it should be lifted automatically? In any case, I'll lift those by default now.

edit: ok, just pushed that change. The stats procs are now lifted by default. Once the CI passes, I'll push a new version.

edit2: new version with the changes is now tagged. The commit adding the lifted procs was:

lf-araujo commented 4 years ago

Thank you. Yes something odd happened on my side, I believe I was getting an error from the interactive Nim shell I was using yesterday.