Closed aterenin closed 5 years ago
This is already possible with an even simpler syntax than what's proposed:
julia> df = DataFrame(A = 1:10, B = 11:20, C = rand(10));
julia> df[:D] = 1
1
julia> df
10×4 DataFrames.DataFrame
│ Row │ A │ B │ C │ D │
├─────┼────┼────┼──────────┼───┤
│ 1 │ 1 │ 11 │ 0.89186 │ 1 │
│ 2 │ 2 │ 12 │ 0.787124 │ 1 │
│ 3 │ 3 │ 13 │ 0.954328 │ 1 │
│ 4 │ 4 │ 14 │ 0.929557 │ 1 │
│ 5 │ 5 │ 15 │ 0.237096 │ 1 │
│ 6 │ 6 │ 16 │ 0.732097 │ 1 │
│ 7 │ 7 │ 17 │ 0.23767 │ 1 │
│ 8 │ 8 │ 18 │ 0.554835 │ 1 │
│ 9 │ 9 │ 19 │ 0.12011 │ 1 │
│ 10 │ 10 │ 20 │ 0.804556 │ 1 │
That syntax is not functional. If I want to make it functional, I need to write the following.
df = DataFrame(A = 1:10, B = 11:20, C = rand(10))
df2 = df |> x -> begin x[:D] = 1; x end
The proposed enhancement would define functional syntax for precisely what you've mentioned.
I'm not sure we should support the same constructors as R, but this sounds like a legitimate task for hcat
. Currently our hcat
methods are kind of weird since do not allow choosing the name of the new column (it's created automatically). We should probably allow that, either via keyword arguments or using pairs. So your code would look like this:
hcat(CSV.read(...), dataset="XXX")
hcat(CSV.read(...), dataset => "XXX")
Thanks for reopening the issue! That syntax with hcat
is exactly what I am discussing, shown below with pipes.
CSV.read("hello") |> hcat(dataset = "hi")
CSV.read("hello") |> hcat(dataset => "hi")
CSV.read("hello") |> hcat(dataset = :hi)
Syntax and its capacity to encourage users to write clean code matters. The above is much cleaner than anything multiline, or my inline syntax with begin
/end
because it is more concise there is no need to balance parentheses anywhere. I think hcat
should be extended to allow it.
I think allowing the single value case as above is reasonable, but please do not implement the generic case of recycling vectors as R does. Adding a constant as a label is one thing; having a vector of 7 recycle across a vector of 49 is repeats evenly is horrid IMO.
I don't think anybody suggested recycling vectors. We've been very careful to only recycle scalars so far.
Duplicate of https://github.com/JuliaData/DataFrames.jl/issues/659.
The following code should work rather than throwing an exception.
The use case for this syntax can be seen from following typical piece of R code.
This reads multiple CSVs into one data frame, with the id of each file appended as an extra variable. The only way I know to do this in one line, as is needed to avoid code becoming big and ugly with many temporary variables polluting the namespace, and without explicitly knowing the DataFrame's length, is by writing the following.
This is arguably rather ugly.
DataFrame(df, D = 1)
is much better.