JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

Append column of all one value without knowing length #1339

Closed aterenin closed 5 years ago

aterenin commented 6 years ago

The following code should work rather than throwing an exception.

df = DataFrame(A = 1:10, B = 11:20, C = rand(10))
df2 = DataFrame(df, D = 1)

The use case for this syntax can be seen from following typical piece of R code.

d.p2 = rbind(
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-enron-100.csv"), dataset="Enron\nK=100"),
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-enron-1000.csv"), dataset="Enron\nK=1000"),
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-nyt-100.csv"), dataset="NYT\nK=100"),
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-nyt-1000.csv"), dataset="NYT\nK=1000"),
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-pubmed-100.csv"), dataset="PubMed\nK=100"),
  data.frame(read.csv("~/Git/Polya-Urn-LDA/experiments/logmargpost-pubmed-1000.csv"), dataset="PubMed\nK=1000")
)

This reads multiple CSVs into one data frame, with the id of each file appended as an extra variable. The only way I know to do this in one line, as is needed to avoid code becoming big and ugly with many temporary variables polluting the namespace, and without explicitly knowing the DataFrame's length, is by writing the following.

df = DataFrame(A = 1:10, B = 11:20, C = rand(10))
df2 = df |> x -> begin x[:D] = 1; x end

This is arguably rather ugly. DataFrame(df, D = 1) is much better.

ararslan commented 6 years ago

This is already possible with an even simpler syntax than what's proposed:

julia> df = DataFrame(A = 1:10, B = 11:20, C = rand(10));

julia> df[:D] = 1
1

julia> df
10×4 DataFrames.DataFrame
│ Row │ A  │ B  │ C        │ D │
├─────┼────┼────┼──────────┼───┤
│ 1   │ 1  │ 11 │ 0.89186  │ 1 │
│ 2   │ 2  │ 12 │ 0.787124 │ 1 │
│ 3   │ 3  │ 13 │ 0.954328 │ 1 │
│ 4   │ 4  │ 14 │ 0.929557 │ 1 │
│ 5   │ 5  │ 15 │ 0.237096 │ 1 │
│ 6   │ 6  │ 16 │ 0.732097 │ 1 │
│ 7   │ 7  │ 17 │ 0.23767  │ 1 │
│ 8   │ 8  │ 18 │ 0.554835 │ 1 │
│ 9   │ 9  │ 19 │ 0.12011  │ 1 │
│ 10  │ 10 │ 20 │ 0.804556 │ 1 │
aterenin commented 6 years ago

That syntax is not functional. If I want to make it functional, I need to write the following.

df = DataFrame(A = 1:10, B = 11:20, C = rand(10))
df2 = df |> x -> begin x[:D] = 1; x end

The proposed enhancement would define functional syntax for precisely what you've mentioned.

nalimilan commented 6 years ago

I'm not sure we should support the same constructors as R, but this sounds like a legitimate task for hcat. Currently our hcat methods are kind of weird since do not allow choosing the name of the new column (it's created automatically). We should probably allow that, either via keyword arguments or using pairs. So your code would look like this:

hcat(CSV.read(...), dataset="XXX")
hcat(CSV.read(...), dataset => "XXX")
aterenin commented 6 years ago

Thanks for reopening the issue! That syntax with hcat is exactly what I am discussing, shown below with pipes.

CSV.read("hello") |> hcat(dataset = "hi")
CSV.read("hello") |> hcat(dataset => "hi")
CSV.read("hello") |> hcat(dataset = :hi)

Syntax and its capacity to encourage users to write clean code matters. The above is much cleaner than anything multiline, or my inline syntax with begin/end because it is more concise there is no need to balance parentheses anywhere. I think hcat should be extended to allow it.

randyzwitch commented 6 years ago

I think allowing the single value case as above is reasonable, but please do not implement the generic case of recycling vectors as R does. Adding a constant as a label is one thing; having a vector of 7 recycle across a vector of 49 is repeats evenly is horrid IMO.

nalimilan commented 6 years ago

I don't think anybody suggested recycling vectors. We've been very careful to only recycle scalars so far.

nalimilan commented 5 years ago

Duplicate of https://github.com/JuliaData/DataFrames.jl/issues/659.