davidavdav / NamedArrays.jl

Julia type that implements a drop-in replacement of Array with named dimensions
Other
120 stars 20 forks source link

Names are dropped when converting a NamedArray to a DataFrame #76

Open arnaudmgh opened 5 years ago

arnaudmgh commented 5 years ago

I ran into a problem when writing the result of freqtable to a CSV file: I converted to DataFrame and lost all the names.

The solution I came up with was to overwrite CSV.write follows:

function CSV.write(file::Union{String, IO}, named::NamedArray)
  nsc = names(named)
  named2 = hcat(nsc[1], named)
  named2 = DataFrame(named2)
  names!(named2, Symbol.(vcat("row_names", string.(nsc[2]))))
  CSV.write(file, named2)
end

I'd be willing to help, submit a PR or else, depending on what suggestions.

Please let me know what would help and make sense. Thanks!

nalimilan commented 5 years ago

This is definitely not specific to FreqTables (implementing the method there would be type piracy), so it should either use Tables.jl or a special DataFrames constructor. Tables.jl doesn't support arrays, so that leaves DataFrames.

Though there's some tension with the way AbstractMatrix behaves: DataFrame(::AbstractMatrix) gives a data frame with the same dimensions as the input. Yet DataFrame(::NamedMatrix) would have an additional column giving the row names. That means NamedArray wouldn't completely work like other AbstractArray objects. A solution would be to have a keyword argument to add row names, which would be off by default.

Another consideration is that a different conversion rule can be considered for higher-dimensional NamedArray objects: have one column per dimension and one row per cell. This is how it works for example in R if you call as.data.frame on a table object (but not on an R named array). This is useful in particular for frequency tables. Maybe we can find a different solution for that, though (something like stack.

arnaudmgh commented 5 years ago

Thank for the explanations and the good points @nalimilan. I agree the transformation of higher dimensional arrays performed by R's as.data.frame looks very much like a stack operation.

So, indeed there is some tension between the intuitive 2 dimensional solution and the higher dimension tables - the function I wrote above would ignore higher dimensions.

One possibility would be to stack by default, even 2d arrays. A user can always unstack if necessary.

dietercastel commented 4 years ago

This should be solved with #99 for arbitrary dimensions.