TidierOrg / Tidier.jl

Meta-package for data analysis in Julia, modeled after the R tidyverse.
MIT License
524 stars 14 forks source link

Columns containing spaces in names #47

Closed kdpsingh closed 1 year ago

kdpsingh commented 1 year ago

We need to test out whether the package works with column names containing spaces. I think Symbol("column name") may work in some contexts but likely not all. Depending on what we find, we may want to think about how to handle cases where it doesn't work, and what to communicate to users in the docs about what does and does not work.

Another longer-term goal to think about implementing a function similar to R's janitor::clean_names(), which converts "illegal" column names into legal ones.

bkamins commented 1 year ago

I dplyr is it

`a b`

right?

I guess you could use the same in Tidier.jl. It would take advantage of the fact that this is a valid cmd in Julia. It would just need to be documented that in Tidier.jl commands non-standard evaluation of commands happens. You could parse it from:

julia> dump(:(`a b`))
Expr
  head: Symbol macrocall
  args: Array{Any}((3,))
    1: GlobalRef
      mod: Module Core
      name: Symbol @cmd
      binding: Ptr{Nothing} @0x0000000000000000
    2: LineNumberNode
      line: Int64 1
      file: Symbol REPL[6]
    3: String "a b"

and you can see that the third arg is the string that would be interpreted as a column name.

I guess this approach would create least friction for R users.

kdpsingh commented 1 year ago

Thanks @bkamins! I was actually trying this approach and having some difficulty in getting MacroTools to detect this AST pattern using @capture. Probably just need to spend more time with it, but I agree this would be the ideal way of handling from an R user standpoint.

bkamins commented 1 year ago

It is just as if you called:

julia> dump(:(@cmd "a b"))
Expr
  head: Symbol macrocall
  args: Array{Any}((3,))
    1: Symbol @cmd
    2: LineNumberNode
      line: Int64 1
      file: Symbol REPL[28]
    3: String "a b"

The only difference is that here @cmd is searched in current scope, while backtick notation guarantees that the lookup is in Core module.

kdpsingh commented 1 year ago

This is really helpful! Will play with this and hope to get this working.

kdpsingh commented 1 year ago

Thanks again, @bkamins!

I think I have this issue figured out.

As-is, using var"column name" works throughout the package.

In PR #61, we now support backticks, which converts the string into var"string".

Before merging, I'm going to:

kdpsingh commented 1 year ago

Documentation added, this is now resolved in #61. Closing this issue. Will open a janitor::clean_names() functionality request as a separate issue.