SciNim / Datamancer

A dataframe library with a dplyr like API
https://scinim.github.io/Datamancer/datamancer.html
MIT License
133 stars 7 forks source link

Formulas raise type mismatch if condition contains variable with same name as key #65

Closed andreas-wilm closed 5 months ago

andreas-wilm commented 5 months ago

Hi there,

Many thanks for this super useful library!

I found a potential bug or at least unexpected behaviour in formulas. The problem happens when using a variable that has the same name as the key/column it's applied on.

Here's some example code:

import datamancer
let df = toDf({ "x" : @[1, 2, 3, 4, 5], "y" : @["a", "b", "c", "d", "e"] })

# works as described in docs: echo df.filter(f{ `x` < 3 or `y` == "e" })
# the below produces a type mismatch:

let y="e"
echo df.filter(f{ `x` < 3 or `y` == y })

This can be easily avoided by renaming the variable to something else, but it might be worthwhile pointing this out in the docs (which I realize are thin in formulas right now).

Thanks again, Andreas

Vindaar commented 5 months ago

Hey! Thanks for the kind words and the issue!

That is indeed just an oversight. I guess I never encountered the problem or forgot to investigate if I ever did. Previous to #66 we were just not adjusting the identifier that is associated to the Tensor as part of the formula loop body in the case where the user used accented quotes. If you wrote:

echo df.filter(f{ `x` < 3 or idx("y") == y })

instead for example, it would produce a yT variable for you instead. Now, we make sure to produce unique identifiers in all the cases.

And yes, sorry that the docs are still a bit short. You can find a bit more here:

https://scinim.github.io/getting-started/basics/data_wrangling.html

but I should probably write even more about formulas in particular.

Just while you're here, I would mention a couple of things:

echo df.filter(f{ idx("x", int) < 3 or idx("y", string) == y })

to be explicit about the type information (this is in addition to the type hints you can put at the beginning of the formula, i.e.

f{string -> int: <formula>}

for example. The latter is convenient, but of course not sufficient if you have multiple different types in an expression.

Hope those points help a bit. And feel free to reach out in case something else is unclear (or broken!)! :)

andreas-wilm commented 5 months ago

What an awesome answer. Thanks so much!