Vindaar / ggplotnim

A port of ggplot2 for Nim
https://vindaar.github.io/ggplotnim
MIT License
177 stars 15 forks source link

Getting the mean of a variable #47

Closed lf-araujo closed 4 years ago

lf-araujo commented 4 years ago

Hi,

Again, thanks for this amazing tool.

Is it possible to calculate the mean of a variable? For instance, in my toy data frame I have a variable named X. The below errs:

import ggplotnim

let df = toDf(readCsv("./data/autoInsurSweden.csv"))
var a = df.select("X")
mean(a)

with

Error: type mismatch: got <DataFrame>
but expected one of: 
proc mean(v`gensym542432: PersistentVector[Value]): Value
  first type mismatch at position: 1
  required type for v`gensym542432: PersistentVector[formula.Value]
  but expression 'a' is of type: DataFrame

expression: mean(a)

Is it possible to get the mean of one of the variables within the df?

Thank you

Vindaar commented 4 years ago

You're welcome!

edit: sorry for the kinda weird line breaks. I wrote the answer in Org mode.

There aren't any convenience functions that operate on full data frames in such a way at the moment.

But there are 2 ways you can do this.

Apply a proc to a full column (a PersistentVector[Value])

You perform the calculation (e.g. mean) on a DF column, like so:

let mean = df["X"].mean
           ^--- get column "X", type `PersistentVector[Value]`
                   ^--- apply desired function of type 
                        `PersistentVector[Value] -> Value`.

The result is simply a Value (use e.g. toFloat to get a normal float from it).

The possible functions are those lifted here: https://github.com/Vindaar/ggplotnim/blob/master/src/ggplotnim/formula.nim#L1095

You can use the lift[Vector|Scalar][String|Int|Float]Proc templates the lift any other proc you like. The Vector templates are for normal procs with signature proc [T](x: seq[T]): T (thus map a sequence to a scalar) and the Scalar templates simply proc [T](x: T): T (i.e. a transformation of the value x). Note: in some cases you might want to only lift a proc locally. By default the templates produce exported procs, which are only allowed at top level. To lift a proc locally use the toExport = false argument (it's a static bool!).

Use summarize to reduce the data frame

this is more in line with your question actually. You can apply a function such as the one above using the summarize proc:

echo df.summarize(f{"X_mean" ~ mean("X")})

where summarize takes a single (or several) functions. The result will be a data frame, which is reduced to a single row, due to the application of a Vector like proc in the aforementioned sense. Since the result here is a full data frame, in order to get the actual mean value, you can do:

let dfMean = df.summarize(f{"X_mean" ~ mean("X")})
echo dfMean["X_mean"][0] # since there's only 1 entry anyways

See the (sorry for the bad documentation) documentation here: https://vindaar.github.io/ggplotnim/formula.html#summarize%2CDataFrame%2Cvarargs%5BFormulaNode%5D

summarize is useful if you want to combine this with some other operation. Especially group_by is special in that regard. If a grouped data frame is handed to summarize the operation will be done for each group! So if you had a DF with a classification column "class" with elements {"A", "B", "C", "D"}:

echo df.group_by("class").summarize(f{"X_mean" ~ mean("X")})

the result would be the means of the 4 classes.

Let me know if this answers your question!

lf-araujo commented 4 years ago

This completely solve my issue. Thanks.

I think what you are doing is a great service to Nim already.

Also make sure to set up a support github link and a Brave BAT account.

lf-araujo commented 4 years ago

Here is a follow up question.

For:

echo df["X"].variance

I get:

image

Variance (from the stats module) expects an openArray, which I can't lift, it seems.

Also,

Vindaar commented 4 years ago

Ah, you're right. That's an omission on my part. Indeed, I haven't lifted any of the procs from the stats module.

edit2: Oh, wow. I completely missed:

which I can't lift,

I'll fix the lifting templates later today to work on openArray!

what lifting means?

Lifting in this context just means to take a proc with signature proc [T](s: seq[T]): T and turn it into one with proc (v: PersistentVector[Value]): Value. Or for scalar procs just to make it work on Value types (if the proc is generic proc [T](s: T): T it shouldn't even be required. But since many procs aren't the Scalar templates are there too).

before edit:

To use it you have to lift it by putting the following at top level in your code:

liftVectorFloatProc(variance)

edit: I'll add those sometime later today to the default lifted procs.

Could you please share a bitcoin address so I can donate? Also make sure to set up a support github link and a Brave BAT account.

That's very kind. I'll think about it!

Vindaar commented 4 years ago

Ok, finally on a computer to check this.

Lifting a proc that takes openArray works as expected. Maybe I misunderstood you and you meant it should be lifted automatically? In any case, I'll lift those by default now.

edit: ok, just pushed that change. The stats procs are now lifted by default. Once the CI passes, I'll push a new version.

edit2: new version with the changes is now tagged. The commit adding the lifted procs was: https://github.com/Vindaar/ggplotnim/commit/862c77ae693991970c2abc53ba1837c1f3d6b22c

lf-araujo commented 4 years ago

Thank you. Yes something odd happened on my side, I believe I was getting an error from the interactive Nim shell I was using yesterday.