Rewrite formula macro implementation

Vindaar commented 3 years ago

After starting with this months ago and getting stuck, I finally picked it up again yesterday.

This is a major rewrite of most parts of the formula macro. It's still complicated, but at least it should be a bit more robust. The automatic type deductions are quite a bit more advanced.

There are more tests required, but this should do for now.

Unfortunately I feel like compilation times have become slower. Maybe that's due to changing from regular for loops to forEach for mem copyable types? Otherwise the generated code should be mostly the same as before.

Vindaar commented 3 years ago

This is currently broken in the implementation:

let df = seqsToDf({ "A" : concat(newSeqWith(50, "a"), newSeqWith(50, "b")),
                    "B" : toSeq(0 ..< 100) })
echo df.group_by("A").summarize(f{sum(`B`)})

FormulaNode(name: "(sum B)", valName: "(sum B)", kind: fkScalar,
            valKind: toValKind(T), fnS: proc (df: DataFrame): Value =
  let B = df["B", Tensor[T]]
  var res = sum(B)
  result = %~res)
/tmp/group_by_filter_sum.nim(5, 34) template/generic instantiation of `{}` from here
/home/basti/CastData/ExternCode/ggplotnim/src/ggplotnim/dataframe/formula.nim(158, 12) Error: undeclared identifier: 'T'

The problem is of course our introduced allowed types: https://github.com/Vindaar/ggplotnim/blob/fixFormulaImpl/src/ggplotnim/dataframe/formulaExp.nim#L95-L96 which include the generic types for better handling.

We need to error with "cannot determine type at CT" in these cases (since we apparently don't have heuristic information either, but: TODO check that we check for that { there is no such information in sum(...) }).

Should be enough to have that in the loop in which we already do the type assignment (combining heuristic, type hints and symbol types), which still lives here: https://github.com/Vindaar/ggplotnim/blob/fixFormulaImpl/src/ggplotnim/dataframe/formula.nim#L947-L971

At the same time that part should be refactored out.

edit: This has been fixed in: https://github.com/Vindaar/ggplotnim/pull/115/commits/77b3748055eb484bfc3aa31762d49eb0d068781c It now errors out at CT with a message to give type hints.

Vindaar commented 3 years ago

Not sure if it's actually a regression or not, but another thing discovered from working on that gist:

echo df.group_by("A").mutate(f{float: "meanB" ~ mean(`B`)})

the mutate call here (and probably in all cases now), doesn't actually use the ~ to mark it as a mapping operation. What this means is that the following:

echo df.group_by("A").mutate(f{float: "meanB" << mean(`B`)})

compiles and gives the exact same result. This should error out somehow?

Ahh, I just realized what's going on: It's the inverse of what I assumed. mutate is handling this case correctly. It sees a reducing procedure, but "doesn't care" about that. It simply assigns the same value for each row in the DF. That makes sense, because calling a mapping operation with a constant is fully valid. Maybe we just want to assign constant values (of each group!) to each row.

The actual wrong thing is that the formula f{"meanB" ~ mean(`B`)} should error at CT with a message that the formula body is reducing but ~ implies a mapping operation (which it clearly isn't!).

edit: this has been fixed in https://github.com/Vindaar/ggplotnim/pull/115/commits/a6f8f8ed27d2e320c3513c20d60b3ef6525e4d6a It now errors at CT with a message to change the given formula kind (the symbol used).

edit2: It's not completely obvious we should always error at CT in case of such a mismatch. Type information doesn't tell the whole story. It is possible we want to use the result of a reducing operation and assign to every element! Consider something like this from our test suite:

    let df = toDf(readCsv("data/mpg.csv"))
      .group_by("class")
      # for simplicity, we're gonna add the mean of each group to
      .mutate(f{float -> float: "subMeanHwy" ~ 0.0 + mean(df["hwy"])})
      .arrange("class")

Adding the mean to the constant zero here may be not the best example, but this may be some constant offset we have to compute. And in the context of a group_by call, it is perfectly reasonable to want to work with the mean of that subgroup.

We'll replace these errors by CT warnings for the time being. This is of course only sensible in one direction: scalar to tensor, but in case of tensor ↦ scalar does not make sense. The CT error for that one stays.

Vindaar commented 3 years ago

Our formulaNameMacro went one step too far. We shouldn't have changed it to convert every Nim expression into a compile time string, because that makes it impossible to assign a name to a column using some runtime information like for instance in ggplotnim:

          let fn = f{float: colStr ~ ms.trans( df[col.toStr][idx] ) }

where colStr is determined at runtime of course.

So buildFormula needs to return a NimNode after all (and should be named buildName instead).

edit: this has been fixed in: https://github.com/Vindaar/ggplotnim/pull/115/commits/bfa2fa118cc885166ca351a5a280e3e88fd4f23d

Vindaar / ggplotnim

Rewrite formula macro implementation #115