TidierOrg / Tidier.jl

Meta-package for data analysis in Julia, modeled after the R tidyverse.
MIT License
515 stars 14 forks source link

Add support for `in` combined with `filter`? #35

Closed vcanogil closed 1 year ago

vcanogil commented 1 year ago
using Tidier
using RDatasets

movies = dataset("ggplot2", "movies")
theseyears = [1971, 1939]
@chain movies begin
    @filter(Year ∈ !!theseyears)
end

Throws an error:

ERROR: MethodError: no method matching iterate(::DataAPI.BroadcastedSelector{Cols{Tuple{Int64, Int64}}})

One workaround that seems to do the trick is the following:

@chain movies begin
    # @filter(Year ∈ !!theseyears)
    @filter(Year ∈ [[1971, 1939]])
end

Can anybody reproduce this?

kdpsingh commented 1 year ago

I would need to confirm, but I think what's happening here is that it's doing an element-wise comparison with [1979, 1939] because ∈ is getting auto-vectorized. This auto-vectorization is actually desirable but normally in Julia, you would need to do Ref([1979, 1939]) on the right side (see: https://www.juliabloggers.com/what-is-∈-in-julia/). Let me take a look.

I may want to impose some logic that detects and fixes this situation to make it work more like R. Will confirm and then propose some options.

bkamins commented 1 year ago

Ref and wrapping in [ ... ] have the same effect in this case.

kdpsingh commented 1 year ago

That makes sense. At the very minimum, I'll add an example to documentation for @filterand will include it here as well since this is a common situation.

kdpsingh commented 1 year ago

@vcanogil, this issue is partially fixed in #36.

The following examples should both now work (after the PR is merged).

@chain movies begin
    @filter(Year in (1971, 1939))
end
@chain movies begin
    @filter(Year in [1971, 1939])
end

Interpolation right now only works for variables. I can probably fix that and will address that in a separate PR.

The syntax shown above also works inside of @mutate(), for example:

@chain movies begin
    @mutate(Included_Year = Year in (1971, 1939))
end
kdpsingh commented 1 year ago

Figured out a fix that will allow interpolation of values. This will introduce one change in behavior:

References to column names will need to be provided as symbols, or as collections (tuples or vectors) of symbols.

Strings (and collections of strings) will now be treated as values rather than being converted into symbols so that you can interpolate numbers and strings.

Once documentation is updated, will put in a PR and will close this issue.

kdpsingh commented 1 year ago

Note to self: I fixed in but need to fix . Will fix this at the same time.

bkamins commented 1 year ago

you also need to remember about

kdpsingh commented 1 year ago

Thanks @bkamins! Will make sure to handle that as well.

kdpsingh commented 1 year ago

Ok this is fully fixed in #38. The documentation webpage has examples on how to use @filter() with in both normally and using interpolation. Closing issue, and pushing this as a patch release to the registry.